Subscribe to our Newsletter

All Posts (212)

Guest blog post by Manav pietro

These days, customer experience, data and brand strategy are gaining a lot of importance in marketing. Both the customer experience and data analysis play a bigger role and marketers are spending more time in focusing on the broader business strategy instead of just focusing on advertising. The infographic titled, “Let’s Talk about Customer Experience”.

According to Gartner study, in the coming years, most of the companies are expected to compete predominantly on the basis of customer experience. Delivering a satisfied and excellent customer experience is the new battleground of the brands. A good customer experience encourages customer loyalty. If a brand fails to provide good customer experience, its customers are likely to go elsewhere. Therefore, businesses are spending more time and effort in developing customer experience strategies. As a result, the need for hiring a customer experience officer has garnered much more attention.

The key role of a customer experience officer or executive is to oversee marketing communications, internal relations, community relations, investor relations and various other interactions between organization and its customers. To know more about customer experience along with related information and facts, please refer the given infographic.

Read more…

Originally posted on Data Science Central

With the innumerable amounts of data generated in the technology era, data scientists have become an increasingly needed vocation. The US just named its first Chief Data Scientist and all the top companies are hiring their own. Yet due to the novelty of this profession, many are not entirely aware of the many career possibilities that come with being a data scientist. Those in the field can look forward to a promising career and excellent compensation. To learn more about what you can do with a career as a data scientist, checkout this infographic created by Rutgers University’s Online Master of Information.

Virtual Reality: Changing the Way Marketers are Conducting Research

Data Scientist Career Trends

Persons interested in pursuing a career in this line of work should be prepared to go the distance in terms of their education. If we look at the current crop of data specialists, we will see that nearly half of them have a PhD at 48%. A further 44% have earned their master’s degree while only 8% have a bachelor’s degree. It is clear that a solid academic background will help in immensely both in gaining the knowledge required for this career as well as in impressing the important gatekeepers in various companies. 

Common Certifications

Getting certified is another good strategy in creating an excellent resume that will draw offers from the best names in the industry. There are four common certifications that are currently available. These are the Certified Analytics Professional (CAP), the Cloudera Certified Professional: Data Scientist (CCP-DS), the EMC: Data Science Associate (EMCDSA), and the SAS Certified Predictive Modeler. Each of these is geared towards specific competencies. Learn more about them to find out the best ones to take for the desired career path.

Job Experience

The explosion of data is a fairly recent phenomenon aided by digital computing and the Internet. Massive amounts of information are now being collected every day and companies are trying to make sense of these. The pioneers have been around for a while but the bulk of the scientists working with data have been on the job for only four years or less at 76%. It’s a good time to enter the field for those who want to be trailblazers in a fresh and exciting area of technology.

Common Responsibilities

There are plenty of issues that are yet to be cleared up with data possibly providing a clear answer once and for all. In this field, practitioners are often relied upon to conduct research on open-ended industry and organization questions. They may also extract large volumes of data from various sources which is a non-trivial task. Then they must clean and remove irrelevant information to make their collections usable. 

Once everything has been primed, the scientists then begin their analysis to check for weaknesses, trends and opportunities. The clues are all in their hands. They simply have to look for the markers and make intelligent connections. Those who are into development can create algorithms that will solve problems and build new automation tools. After they have compiled all of their findings, they must then effectively communicate the results to the non-technical members of the management. 

Expected Salary

Data scientists are well-compensated for their technical skills. Their average earnings will depend on their years of experience in the field. Entry-level workers with less than 5 years under their belt can expected to earn around $92,000 annually. With almost a decade in data analysis, a person can take home $109,000 per year. Experienced scientists with nearly two decades in this career get about $121,000. The most respected pioneers earn $145,000 a year or more. The median salary was found to be $116,840 in 2016.

Career Possibilities

There are several industries with high demand for data scientists. It should be no surprise that the largest employer is the technology sector with about 41%. This is followed by 13% who work in marketing, 11% in corporate setting, 9% in consulting, 7% in health care, and 6% in financial services. The rest are scattered across government, academia, retail and gaming.

Job Roles

At their chosen workplace, they often take on more than one job role. Around 55.9% act as researchers for their company, mining the data for valuable information. Another common task is business management with 40.2% saying they work in this capacity. Many are asked by their employer to use their skills as developers and creatives at 36.5% and 36.3%. 

Career Profile of US Chief Data Scientist

Dr. DJ Patil was an undergrad in Mathematics at the University of California in San Diego before earning his PhD in Applied Mathematics at the University of Maryland. Here he used his skills to improve the numerical weather forecasting by NOAA using their open datasets. He has written numerous publications that highlight the important applications of data science. In fact, he co-coined the term data scientist. His efforts have led to global recognition including an award at the 2014 World Economic Forum. In 2015, he was appointed as the US Chief Data Scientist.

His work experiences have enabled him to use his skills in various industries. For instance, he was the Vice president of Product at RelateIQ, Head of Data Products and Chief Security Officer at LinkedIn, Data Scientist in Residence at Greylock Partners, Director of Strategy at eBay, Assistant Research Scientist at the University of Maryland, and AAAS Policy Fellow at the Department of Defense. 

Job Growth and Demand

Projections for this career are rosy with well-known publications hailing it as the next big thing. Glassdoor named it the Top Job in America for 2016. The Harvard Business Review called it the Sexist Job of the 21st Century. The good news for those who are thinking about starting on this path is that there’s plenty of room for new people. Nearly 80% of data scientists report a shortage in their field. They need reinforcement given the volume of work that they have to do. In fact, the projected growth over the next decade is at 11%, which is higher than the 7% estimated growth for all occupations.

Expert Tips

According to the experts, interested individuals must do these three things if they wish to succeed in the field: spend time learning effective analytics communication, consider relocation, and interact with other data scientists. The first is crucial as this involves highly technical work with results that need to be understood by non-technical managers. The second is a practical move with 75% of available jobs located on the East and West Coasts. The third is an advice common to all fields: widen your network, learn from your peers, and create future opportunities.

Read original article here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

13 Great Data Science Infographics

Originally posted on Data Science Central

Most of these infographics are tutorials covering various topics in big data, machine learning, visualization, data science, Hadoop, R or Python, typically intended for beginners. Some are cheat sheets and can be nice summaries for professionals with years of experience. Some, popular a while back (you will find one example here) were designed as periodic tables.

For Geeks 

For Business People

Infographics Repositories

Read more…

Twitter Analytics using Tweepsmap

Guest blog post by Salman Khan

This morning I saw #tweepsmap on my twitter feed and decided to check it out. Tweepsmap is a a neat tool that can analyze any twitter account from a social network perspective. It can create interactive maps showing where the followers of a twitter account reside , segment followers  and even show who unfollowed you!

Here is my Followers map generated by country.

You can create the followers map based on city and state as well.

Tweepsmap also provides demographic information such as languages, occupation and gender but it relies on the twitter user having entered this information in the twitter profile.

There is also a hashtag and keyword analyzer that reports on most prolific tweeters, locations of tweets, tweets vs. retweets and so on. I used their free report which is built for a maximum of 100 tweets to analyze the trending hashtag –> #BeautyAndTheBeast. For some reasons, the #BeautyAndTheBeast hashtag is really popular in Brazil, out of the 100 tweets, 26 were from Brazil and 20 from USA.  You can see that 3 out of 5 of the top influencers with most followers are tweeting in Portuguese. Other visualizations included  the tweets vs. retweet numbers and the distribution reach of the tweeters. I was even able to get make the report public so you can check it out here.  Remember it only analyzes 100 tweets so don’t draw any conclusions from it !


 If you are doing research on social media or are a business that wants to learn more about competitors and customers, tweepsmap helps you analyze specific twitter accounts as well! Of course, we all know there is no such thing as a free lunch , so this is a paid feature!

From what I saw by tinkering with the pricing calculator on their page, the analysis of a twitter account with more than 2.5M followers will cost a flat fee of $5K. I tried a few twitter accounts to see how much each would cost based on number of followers and found that the cost per follower was $0.002.  So if you wanted to get twitter data on Hans Rosling, it would cost you $642 as he has 320,956 followers (642/320,956 = 0.002).

 calculate3  calculate5   calculate4  calculate6 calculate7 calculate8calculate9       calculate2

Overall this looks like a neat tool to get started when analyzing twitter data and using this information to maximize the returns on your tweets. I have only mentioned a few of their tools above; they have other features like the Best Time to Tweet which will analyze your audience, twitter history, time zones and so on to predict when you get the most out of your tweet. Check out their website for more info here

Read more…

Originally posted on Data Science Central

Written by sought-after speaker, designer, and researcher Stephanie D. H. Evergreen, Effective Data Visualizationshows readers how to create Excel charts and graphs that best communicate data findings. This comprehensive how-to guide functions as a set of blueprints—supported by research and the author’s extensive experience with clients in industries all over the world—for conveying data in an impactful way. Delivered in Evergreen’s humorous and approachable style, the book covers the spectrum of graph types available beyond the default options, how to determine which one most appropriately fits specific data stories, and easy steps for making the chosen graph in Excel. 

The book is available, here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Max Wegner

What’s the first thing you think of when you hear the phrase “artificial intelligence”? Perhaps it’s the HAL 9000 from 2001: A Space Odyssey, or maybe it’s chess Grandmaster Garry Kasparov losing to IBM’s Deep Blue supercomputer. While those are indeed examples of artificial intelligence, examples of AI in the real world of today are a bit more mundane and a whole lot less sinister.

In fact, many of us use AI, in one form or another, in our everyday lives. The personal assistant on your smartphone that helps you locate information, the facial recognition software on Facebook photos, and even the gesture control on your favourite video game are all examples of practical AI applications. Rather than being a part of a dystopian world view in which the machines take over, current AI makes our lives a whole lot more convenient by carrying out simple tasks for us.

What’s more, there’s a lot of money flowing into a lot of companies working on AI developments. This means that in the near future, we could see even more practical uses for AI, from smart robots to smart drones and more.

To give you a better understanding of the current state of AI, our friends at have put together this helpful Artificial Intelligence infographic. It will give you the full rundown, from categories to geography to finances. Check it out, and you’ll see why AI is so essential to our everyday lives, and why the future of AI looks so bright.

Read more…

Originally posted on Data Science Central

This article was posted by Bethany Cartwright. Bethany is the blog team's Data Visualization Intern. She spends most of her time creating infographics and other visuals for blog posts.

Whether you’re writing a blog post, putting together a presentation, or working on a full-length report, using data in your content marketing strategy is a must. Using data helps enhance your arguments by make your writing more compelling. It gives your readers context. And it helps provide support for your claims.

That being said, if you’re not a data scientist yourself, it can be difficult to know where to look for data and how to best present that data once you’ve got it. To help, below you'll find the tools and list of resources you need to source credible data and create some stunning visualizations. 

Resources for Uncovering Credible Data

When looking for data, it’s important to find numbers that not only look good, but are also credible and reliable.

The following resources will point you in the direction of some credible sources to get you started, but don’t forget to fact-check everything you come across. Always ask yourself: Is this data original, reliable, current, and comprehensive?

Tools for Creating Data Visualizations

Now that you know where to find credible data, it’s time to start thinking about how you’re going to display that data in a way that works for your audience.

At its core, data visualization is the process of turning basic facts and figures into a digestible image --  whether it’s a chart, graph, timeline, map, infographic, or other type of visual. 

While understanding the theory behind data visualization is one thing, you also need the tools and resources to make digital data visualization possible. Below we’ve collected 10 powerful tools for you to browse, bookmark, or download to make designing data visuals even easier for your business.

To check all this information, click hereFor more articles about data visualization, click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Mike Waldron

Originally posted on Data Science Central

This blog was originally published on the AYLIEN Text Analysis blog

We wanted to gather and analyze news content in order to look for similarities and differences in the way two journalists write headlines for their respective news articles and blog posts. The two reporters we selected operate in, and write about, two very different industries/topics and have two very different writing styles:

  • Finance: Akin Oyedele of Business Insider, who covers market updates.
  • Celebrity: Carly Ledbetter of the Huffington Post, who mainly writes about celebrities.

Note: For a more technical, in-depth and interactive representation of this project, check out the Jupyter notebook we created. This includes sample code and more in depth descriptions of our approach.

The Approach

  1. Collect news headlines from both of our journalists
  2. Create parse trees from collected headlines (we explain parse trees below!)
  3. Extract information from each parse tree that is indicative of the overall headline structure
  4. Define a simple sequence similarity metric to quantitatively compare any pair of headlines
  5. Apply the same metric to all headlines collected for each author to find similarity
  6. Use K-Means and tSNE to produce a visual map of all the headlines so we can clearly see the differences between our two journalists

Creating Parse Trees

In linguistics, a parse tree is a rooted tree that represents the syntactic structure of a sentence, according to some pre-defined grammar. Here’s an example;

For example with a simple sentence like “The cat sat on the mat”, a parse tree might look like this;

Thankfully parsing our extracted headlines isn’t too difficult. We used the Pattern Library for Python to parse the headlines and generate our parse trees.


In total we gathered about 700 article headlines for both journalists using the AYLIEN News API which we then analyzed using Python. If you’d like to give it a go yourself, you can grab the Pickled data files directly from the GitHub repository (link), or by using the data collection notebook we prepared for this project.

First we loaded all the headlines for Akin Oyedele, then we created parse trees for all 700 of them, and finally we stored them together with some basic information about the headline in the same Python object.

Then using a sequence similarity metric, we compared all of these headlines two by two, to build a similarity matrix.


To visualize headline similarities for Akin we generated a 2D scatter plot with the hope of grouping similarly structured headlines close together in a graph in groups of sorts.

To achieve this, we first reduced the dimensionality of our similarity matrix using tSNE and applied K-Means clustering to find groups of similar headlines. We also used some a nice viz library which we’ve outlined below;

  • tSNE to reduce the dimensionality of our similarity matrix from 700 down to 2
  • K-Means to identify 5 clusters of similar headlines and add some color
  • Plotted the actual chart using Bokeh

The chart above shows a number of dense groups of headlines, as well as some sparse ones. Each dot on the graph represents a headline, as you can see when you hover over one in the interactive version. Similar titles are as you can see grouped together quite cleanly. Some of the more stand-outish groups are;

  • The circular group left of center typically consists of short, snappy stock update headlines such as “Viacom is crashing”
  • The large circular group on the top right are mostly announcement-style headlines such as “Here come the…. “ formats.
  • The small green circular group towards the bottom left are similar and use the same phrases we see headlines such as “Industrial production falls more than expected” or “ADP private payrolls rise more than expected”.

Comparing the two authors

By repeating the process for our second journalist, Carly Ledbetter, we were then able to compare both authors and see how many common patterns exist between the two in terms of how they write their headlines.

We observed that roughly 50% (347/700) of the headlines had a similar structure.

Here we can see the same dense and sparse patterns, as well as groups of points that are somewhat unique to each author, or shared by both authors. The yellow dots represent our Celebrity focused author and the blue our finance guy.

  • The bottom right cluster is almost exclusive to the first author, as it covers the short financial/stock report headlines such as “Here comes CPI”, but it also covers some of the headlines from the first author such as “There’s Another Leonardo DiCaprio Doppelgänger”. Same could be said about the top middle cluster.
  • The top right cluster mostly contains single-verb headlines about celebrities doing things, such as “Kylie Jenner Graces Coachella With Her Peachy Presence” or “Kate Hudson Celebrated Her Birthday With A Few Shirtless Men” but it also includes market report headlines from the first author such as “Oil rig count plunges for 7th straight week”.

Conclusion and future work

In this project we’ve shown how you can retrieve and analyze news headlines, evaluate their structure and similarity, and visualize the results on an interactive map.

While we were quite happy with the results and found it quite interesting there were some areas that we thought could be improved. Some of the weaknesses of our approach, and ways to improve them are:

 - Using entire parse trees instead of just the chunk types

 - Using a tree or graph similarity metric instead of a sequence similarity one (ideally a linguistic-aware one too)

 - Better pre-processing to identify and normalize Named Entities, etc.

Next up..

In our next post, we’re going to study the correlations between various headline structures and some external metrics like number of Shares and Likes on Social Media platforms, and see if we can uncover any interesting patterns. We can hazard a guess already that the short, snappy Celebrity style headlines would probably get the most shares and reach on social media, but there’s only one way to find out.

If you’d like to access the data used or want to see the sample code we used head over to our Jupyter notebook.

Read more…

Guest blog post by Jeff Pettiross

For almost as long as we have been writing, we’ve been putting meaning into maps, charts, and graphs. Some 1,300 years ago, Chinese astronomers recorded the position of the stars and the shapes of the constellations. The Dunhuang star maps are the oldest preserved atlas of the sky:

More than 500 years ago, the residents of the Marshall Islands learned to navigate the surrounding waters by canoe in the daytime—without the aid of stars. These master rowers learned to recognize the feel of the currents reflecting off the nearby islands. They visualized their insights on maps made of sticks, rocks, and shells.

In the 1800s, Florence Nightingale used charts to explain to government officials how treatable diseases were killing more soldiers in the Crimean War than battle wounds. She knew that pictures would tell a more powerful story than numbers alone:

Why Visualized Data Is So Powerful

Since long before spreadsheets and graphing software, we have communicated data through pictures. But we’ve only begun, in the last half-century, to understand why visualizations are such effective tools for seeing and understanding data.

It starts with the part of your brain called the visual cortex. Located near the bony lump at the back of your skull, it processes input from your eyes. Thanks to the visual cortex, our sense of sight provides information much faster than the other senses. We actually begin to process what we see before we think about it.

This is sound from an evolutionary perspective. The early human who had to stop and think, “Hmm, is that a jaguar sprinting toward me?” probably didn’t survive to pass on their genes. There is a biological imperative for our sense of sight to override cognition—in this case, for us to pay sharp attention to movement in our peripheral vision.

Today, our sight is more likely to save us on a busy street than on the savannah. Moving cars and blinking lights activate the same peripheral attention, helping us navigate a complicated visual environment. We see other cues on the street, too. Bright orange traffic cones mark hazards. Signs call out places, directions, and warnings. Vertical stripes on the street indicate lanes while horizontal lines indicate stop lines.

We have designed a rich, visual system that drivers can comprehend quickly, thanks to perceptual psychology. Our visual cortex is attuned to color hues (like safety orange), position (signs placed above road), and line orientation (lanes versus stop lines). Research has identified other visual features. Size, clustering, and shape also help us perceive our environment almost immediately.

What This Means for Us Today

Fortunately, our offices and homes tend to be safer than the savannah or the highway. Regardless, our lightning-quick sense of vision jumps into action even when we read email, tweets, or websites. And that right there is why data visualization communicates so powerfully and immediately: It takes advantage of these visual features, too.

A line graph immediately reveals upward or downward changes, thanks to the orientation of each segment. The axes of the graph use position to communicate values in relationship to each other. If there are multiple, colored lines, the color hue lets us rapidly tell the lines apart, no matter how many times they cross. Bar charts, maps with symbols, area graphs—these all use the visual superhighway in our brains to communicate meaning.

The early pioneers of data visualization were led by their intuition to use visual features like position, clustering, and hue. The longevity of those works is a testament to their power.

We now have software to help us visualize data and to turn tables of facts and figures into meaningful insights. That means anyone, even non-experts, can explore data in a way that wasn’t possible even 20 years ago. We can, all of us, analyze the world’s growing volume of data, spot trends and outliers, and make data-driven decisions.

Today, we don’t just have charts and graphs; we have the science behind them. We have started to unlock the principles of perception and cognition so we can apply them in new ways and in various combinations. A scatter plot can leverage position, hue, and size to visualize data. Its data points can interactively filter related charts, allowing the user to shift perspectives in their analysis by simply clicking on a point. Animating transitions as users pivot from one idea to the next brings previously hidden differences to the foreground. We’re building on the intuition of the pioneers and the conclusions of science to make analysis faster, easier, and more intuitive.

When humanity unlocked the science behind fire and magnets, we learned to harness chemistry and physics in new ways. And we revolutionized the world with steam engines and electrical generators.

Humanity is now at the dawn of a new revolution, and intuitive tools are putting the beautiful science of data visualization into the hands of millions of users.

I’m excited to see where you take all of us next.

Note: This post first appeared in VentureBeat.

Read more…

Investigating Airport Connectedness

Guest blog post by SupStat

Contributed by the neuroscientist Sricharan Maddineni. He holds huge passion and talents in data science. Thus he took NYC Data Science Academy 12 weeks boot camp program  between Jan 11th to Apr 1st, 2016. The post was based on his second project, which posted on February 16th (due at 4th week of the program). He acquired the publicly transportation data and consult from social media. Consuming the data through his mind, he visualized the economic and business insights.

Why Are Airports Important?

(Photo by

Aviation infrastructure has been a bedrock of the United States economy and culture for many decades, and it was the first instrument through which we connected with the world. Before the invention of flight, humans were inexorably confined by the immenseness of Earth's oceans.

All the disdain and unpleasantries we endure on flights are quickly forgotten once we safely land at our destinations and realize we have just been transported to a new place on our vast planet. Every time I have flown and landed in a new country or city, I am overwhelmed with feelings of how beautiful our world is and how much I wish I could visit every corner of our planet. My love of aviation has led me to investigate the connectedness of United States airports and the passenger-disparity between the developed and developing countries.

The App

The interactive map can be used as a tool to investigate the connectedness of the US airports. Users can choose from a list of airports including LAX, JFK, IAD and more to visualize the connections out of that airport. The 'Airport Connections' table shows us the combinations of connections by Airline Carrier. For example, we can see that American Airlines (AA) had 8058 flights out of LAX to JFK (2009 dataset). The 'Carriers' table shows us the total flights out of LAX by American Airlines (76,670).

If we select Hartsfield-Jackson Atlanta International, we see that it is the most connected airport in the United States. *Please note that I am not plotting all possible connections, just major airport connections and only within the United States (the map would be filled solid if I plotted all connections!). The size of the airport bubble is calculated by the number of connections. Therefore, all large bubbles are international airports, and smaller bubbles are regional/domestic airports.

I also plotted Voronoi tesselations between the airports using one nearest neighbor to show the area differences between airports in the Eastcoast/Westcoast/Midwest. The largest polygons are found in the Midwest because airports are far apart in all directions. These airports are generally more connected as well since they are connecting the east and west coast (see Denver International or Salt Lake City International). Clicking on a Voronoi polygon brings up the nearest airport within that area.

Why is it important for countries to improve their airport infrastructure?

Looking at the Motion/Bubble Chart, we observe that developing countries travel horizontally whereas developed countries travel vertically. This indicates that developed countries populations have remained steady, but they have seen a rise in passenger travelers. On the flip side, developing countries have seen their populations boom, but the number of air travelers has remained stagnant.

Most importantly, countries moving upward show noticeable gains in GDP whereas countries moving horizontally show minimal gains over the last four decades (GDP is represented by the size of the bubble). We can also notice that airline passenger counts plunge during recessions for first world countries but remain comparatively steady for developing countries (1980, 2000, 2009). We can interpret this to mean that developing countries are not as connected to the rest of the world since their economies are unaffected by global economic crises.

Passenger Counts during weekends and Holidays

The calendar heatmap shows us the Daily flight count in the United States. We can recognize that airlines operate significantly fewer flights on Saturdays and National Holidays such as July 4th and Thanksgiving. The days leading up to and after National Holidays show an increase in flights as expected. Looking carefully, you can also notice there are fewer flights on Tuesdays and Wednesdays, and there are more flights during the summer season.

If you select a day on the calendar, a table shows us the top 20 Airline carrier flight counts on that day. Southwest, American Airlines, SkyWest, and Delta seem to operate the most airlines in the United States.


The Data

1. Interactive Map

I utilized comprehensive datasets provided by the United States Department of Transportation and Open Data by Socrata that allowed me to map airport connections in the United States. The first airport dataset included airport locations (city/state) and their latitude and longitude degrees, and the second dataset included the airport connections (LAX - JFK, LAX-SFO, ...). First, I used these datasets to calculate the size of the airport based on how many connections each had.

2. Motion Chart

The second analysis was done using the airline passenger, population, and GDP numbers for the world's countries over the last 45 years. Most of the work here was in transforming the three datasets provided by the World Bank from wide to long. See the code below.

3. Calendar Chart

Lastly, I used the Transtats database to obtain the daily flight counts by Airline Carrier for the years 2004-2007. Some transformation was done to create two separate data frames - flight counts per day and flight counts per carrier. While trying to calculate flight counts by day, I tried this code:

f2007_2 <- f2007 %>% group_by(UniqueCarrier, month) %>% summarise(sum = n()) 

I knew there as an error by looking at the resulting heatmap, but I didn't realize this was showing me a cumulative sum by month rather than the daily flight count, so I hit twitter to see if I could get help diagnosing my problem. I tweeted Jeff Weis who appeared as the Aviation Analyst on CNN during the Malaysian Airlines MH370 disappearance, and he caught my mistake! After he had pushed me in the right direction, I corrected my code to: 

group_by(UniqueCarrier, date) %>% summarise(count = n()) 

The Code

Creating Voronoi Polygons

Connection Lines

The second step was creating the line connections between the airports. To do this, I used the polylines function in Leaflet to add connecting lines between airports filtered by user input. input$Input1 catches the user selected airport and subsets the dataset by all origin airports that equal the selected airport. The gcIntermediate function makes those lines curved.

Calendar json capture

The calendar chart required two parameters, the whichdatevar reads the date column, and numvar which plots the value for each day on the calendar. Then I utilized a gvis.listener.jscode method to capture the user selected date and filter the dataset for the table.

To experience Sricharan Maddineni the interactive Shiny App

Read more…

Guest blog post by Irina Papuc

Machine Learning (ML) is coming into its own, with a growing recognition that ML can play a key role in a wide range of critical applications, such as data mining, natural language processing, image recognition, and expert systems. ML provides potential solutions in all these domains and more, and is set to be a pillar of our future civilization.

The supply of able ML designers has yet to catch up to this demand. A major reason for this is that ML is just plain tricky. This tutorial introduces the basics of Machine Learning theory, laying down the common themes and concepts, making it easy to follow the logic and get comfortable with the topic.

What is Machine Learning?

So what exactly is “machine learning” anyway? ML is actually a lot of things. The field is quite vast and is expanding rapidly, being continually partitioned and sub-partitioned ad nauseam into different sub-specialties and types of machine learning.

There are some basic common threads, however, and the overarching theme is best summed up by this oft-quoted statement made by Arthur Samuel way back in 1959: “[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.”

And more recently, in 1997, Tom Mitchell gave a “well-posed” definition that has proven more useful to engineering types: “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” -- Tom Mitchell, Carnegie Mellon University

So if you want your program to predict, for example, traffic patterns at a busy intersection (task T), you can run it through a machine learning algorithm with data about past traffic patterns (experience E) and, if it has successfully “learned”, it will then do better at predicting future traffic patterns (performance measure P).

The highly complex nature of many real-world problems, though, often means that inventing specialized algorithms that will solve them perfectly every time is impractical, if not impossible. Examples of machine learning problems include, “Is this cancer?”, “What is the market value of this house?”, “Which of these people are good friends with each other?”, “Will this rocket engine explode on take off?”, “Will this person like this movie?”, “Who is this?”, “What did you say?”, and “How do you fly this thing?”. All of these problems are excellent targets for an ML project, and in fact ML has been applied to each of them with great success.

ML solves problems that cannot be solved by numerical means alone.

Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning:

  • Supervised machine learning: The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data.

  • Unsupervised machine learning: The program is given a bunch of data and must find patterns and relationships therein.

We will primarily focus on supervised learning here, but the end of the article includes a brief discussion of unsupervised learning with some links for those who are interested in pursuing the topic further.

Supervised Machine Learning

In the majority of supervised learning applications, the ultimate goal is to develop a finely tuned predictor function h(x) (sometimes called the “hypothesis”). “Learning” consists of using sophisticated mathematical algorithms to optimize this function so that, given input data x about a certain domain (say, square footage of a house), it will accurately predict some interesting value h(x) (say, market price for said house).

In practice, x almost always represents multiple data points. So, for example, a housing price predictor might take not only square-footage (x1) but also number of bedrooms (x2), number of bathrooms (x3), number of floors (x4), year built (x5), zip code (x6), and so forth. Determining which inputs to use is an important part of ML design. However, for the sake of explanation, it is easiest to assume a single input value is used.

So let’s say our simple predictor has this form:

where and are constants. Our goal is to find the perfect values of and to make our predictor work as well as possible.

Optimizing the predictor h(x) is done using training examples. For each training example, we have an input value x_train, for which a corresponding output, y, is known in advance. For each example, we find the difference between the known, correct value y, and our predicted value h(x_train). With enough training examples, these differences give us a useful way to measure the “wrongness” of h(x). We can then tweak h(x) by tweaking the values of and to make it “less wrong”. This process is repeated over and over until the system has converged on the best values for and . In this way, the predictor becomes trained, and is ready to do some real-world predicting.

A Simple Machine Learning Example

We stick to simple problems in this post for the sake of illustration, but the reason ML exists is because, in the real world, the problems are much more complex. On this flat screen we can draw you a picture of, at most, a three-dimensional data set, but ML problems commonly deal with data with millions of dimensions, and very complex predictor functions. ML solves problems that cannot be solved by numerical means alone.

With that in mind, let’s look at a simple example. Say we have the following training data, wherein company employees have rated their satisfaction on a scale of 1 to 100:

First, notice that the data is a little noisy. That is, while we can see that there is a pattern to it (i.e. employee satisfaction tends to go up as salary goes up), it does not all fit neatly on a straight line. This will always be the case with real-world data (and we absolutely want to train our machine using real-world data!). So then how can we train a machine to perfectly predict an employee’s level of satisfaction? The answer, of course, is that we can’t. The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.

It is somewhat reminiscent of the famous statement by British mathematician and professor of statistics George E. P. Box that “all models are wrong, but some are useful”.

The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.

ML builds heavily on statistics. For example, when we train our machine to learn, we have to give it a statistically significant random sample as training data. If the training set is not random, we run the risk of the machine learning patterns that aren’t actually there. And if the training set is too small (see law of large numbers), we won’t learn enough and may even reach inaccurate conclusions. For example, attempting to predict company-wide satisfaction patterns based on data from upper management alone would likely be error-prone.

With this understanding, let’s give our machine the data we’ve been given above and have it learn it. First we have to initialize our predictor h(x) with some reasonable values of and . Now our predictor looks like this when placed over our training set:

If we ask this predictor for the satisfaction of an employee making $60k, it would predict a rating of 27:

It’s obvious that this was a terrible guess and that this machine doesn’t know very much.

So now, let’s give this predictor all the salaries from our training set, and take the differences between the resulting predicted satisfaction ratings and the actual satisfaction ratings of the corresponding employees. If we perform a little mathematical wizardry (which I will describe shortly), we can calculate, with very high certainty, that values of 13.12 for and 0.61 for are going to give us a better predictor.

And if we repeat this process, say 1500 times, our predictor will end up looking like this:

At this point, if we repeat the process, we will find that and won’t change by any appreciable amount anymore and thus we see that the system has converged. If we haven’t made any mistakes, this means we’ve found the optimal predictor. Accordingly, if we now ask the machine again for the satisfaction rating of the employee who makes $60k, it will predict a rating of roughly 60.

Now we’re getting somewhere.

A Note on Complexity

The above example is technically a simple problem of univariate linear regression, which in reality can be solved by deriving a simple normal equation and skipping this “tuning” process altogether. However, consider a predictor that looks like this:

This function takes input in four dimensions and has a variety of polynomial terms. Deriving a normal equation for this function is a significant challenge. Many modern machine learning problems take thousands or even millions of dimensions of data to build predictions using hundreds of coefficients. Predicting how an organism’s genome will be expressed, or what the climate will be like in fifty years, are examples of such complex problems.

Many modern ML problems take thousands or even millions of dimensions of data to build predictions using hundreds of coefficients.

Fortunately, the iterative approach taken by ML systems is much more resilient in the face of such complexity. Instead of using brute force, a machine learning system “feels its way” to the answer. For big problems, this works much better. While this doesn’t mean that ML can solve all arbitrarily complex problems (it can’t), it does make for an incredibly flexible and powerful tool.

Gradient Descent - Minimizing “Wrongness”

Let’s take a closer look at how this iterative process works. In the above example, how do we make sure and are getting better with each step, and not worse? The answer lies in our “measurement of wrongness” alluded to previously, along with a little calculus.

The wrongness measure is known as the cost function (a.k.a., loss function), . The input represents all of the coefficients we are using in our predictor. So in our case, is really the pair and . gives us a mathematical measurement of how wrong our predictor is when it uses the given values of and .

The choice of the cost function is another important piece of an ML program. In different contexts, being “wrong” can mean very different things. In our employee satisfaction example, the well-established standard is the linear least squares function:

With least squares, the penalty for a bad guess goes up quadratically with the difference between the guess and the correct answer, so it acts as a very “strict” measurement of wrongness. The cost function computes an average penalty over all of the training examples.

So now we see that our goal is to find and for our predictor h(x) such that our cost function is as small as possible. We call on the power of calculus to accomplish this.

Consider the following plot of a cost function for some particular ML problem:

Here we can see the cost associated with different values of and . We can see the graph has a slight bowl to its shape. The bottom of the bowl represents the lowest cost our predictor can give us based on the given training data. The goal is to “roll down the hill”, and find and corresponding to this point.

This is where calculus comes in to this machine learning tutorial. For the sake of keeping this explanation manageable, I won’t write out the equations here, but essentially what we do is take the gradient of , which is the pair of derivatives of (one over and one over ). The gradient will be different for every different value of and , and tells us what the “slope of the hill is” and, in particular, “which way is down”, for these particular s. For example, when we plug our current values of into the gradient, it may tell us that adding a little to and subtracting a little from will take us in the direction of the cost function-valley floor. Therefore, we add a little to , and subtract a little from , and voilà! We have completed one round of our learning algorithm. Our updated predictor, h(x) = + x, will return better predictions than before. Our machine is now a little bit smarter.

This process of alternating between calculating the current gradient, and updating the s from the results, is known as gradient descent.

That covers the basic theory underlying the majority of supervised Machine Learning systems. But the basic concepts can be applied in a variety of different ways, depending on the problem at hand.

Classification Problems

Under supervised ML, two major subcategories are:

  • Regression machine learning systems: Systems where the value being predicted falls somewhere on a continuous spectrum. These systems help us with questions of “How much?” or “How many?”.

  • Classification machine learning systems: Systems where we seek a yes-or-no prediction, such as “Is this tumer cancerous?”, “Does this cookie meet our quality standards?”, and so on.

As it turns out, the underlying theory is more or less the same. The major differences are the design of the predictor h(x) and the design of the cost function .

Our examples so far have focused on regression problems, so let’s now also take a look at a classification example.

Here are the results of a cookie quality testing study, where the training examples have all been labeled as either “good cookie” (y = 1) in blue or “bad cookie” (y = 0) in red.

In classification, a regression predictor is not very useful. What we usually want is a predictor that makes a guess somewhere between 0 and 1. In a cookie quality classifier, a prediction of 1 would represent a very confident guess that the cookie is perfect and utterly mouthwatering. A prediction of 0 represents high confidence that the cookie is an embarrassment to the cookie industry. Values falling within this range represent less confidence, so we might design our system such that prediction of 0.6 means “Man, that’s a tough call, but I’m gonna go with yes, you can sell that cookie,” while a value exactly in the middle, at 0.5, might represent complete uncertainty. This isn’t always how confidence is distributed in a classifier but it’s a very common design and works for purposes of our illustration.

It turns out there’s a nice function that captures this behavior well. It’s called the sigmoid function, g(z), and it looks something like this:

z is some representation of our inputs and coefficients, such as:

so that our predictor becomes:

Notice that the sigmoid function transforms our output into the range between 0 and 1.

The logic behind the design of the cost function is also different in classification. Again we ask “what does it mean for a guess to be wrong?” and this time a very good rule of thumb is that if the correct guess was 0 and we guessed 1, then we were completely and utterly wrong, and vice-versa. Since you can’t be more wrong than absolutely wrong, the penalty in this case is enormous. Alternatively if the correct guess was 0 and we guessed 0, our cost function should not add any cost for each time this happens. If the guess was right, but we weren’t completely confident (e.g. y = 1, but h(x) = 0.8), this should come with a small cost, and if our guess was wrong but we weren’t completely confident (e.g. y = 1 but h(x) = 0.3), this should come with some significant cost, but not as much as if we were completely wrong.

This behavior is captured by the log function, such that:

Again, the cost function gives us the average cost over all of our training examples.

So here we’ve described how the predictor h(x) and the cost function differ between regression and classification, but gradient descent still works fine.

A classification predictor can be visualized by drawing the boundary line; i.e., the barrier where the prediction changes from a “yes” (a prediction greater than 0.5) to a “no” (a prediction less than 0.5). With a well-designed system, our cookie data can generate a classification boundary that looks like this:

Now that’s a machine that knows a thing or two about cookies!

An Introduction to Neural Networks

No discussion of ML would be complete without at least mentioning neural networks. Not only do neural nets offer an extremely powerful tool to solve very tough problems, but they also offer fascinating hints at the workings of our own brains, and intriguing possibilities for one day creating truly intelligent machines.

Neural networks are well suited to machine learning problems where the number of inputs is gigantic. The computational cost of handling such a problem is just too overwhelming for the types of systems we’ve discussed above. As it turns out, however, neural networks can be effectively tuned using techniques that are strikingly similar to gradient descent in principle.

A thorough discussion of neural networks is beyond the scope of this tutorial, but I recommend checking out our previous post on the subject.

Unsupervised Machine Learning

Unsupervised learning typically is tasked with finding relationships within data. There are no training examples used in this process. Instead, the system is given a set data and tasked with finding patterns and correlations therein. A good example is identifying close-knit groups of friends in social network data.

The algorithms used to do this are very different from those used for supervised learning, and the topic merits its own post. However, for something to chew on in the meantime, take a look at clustering algorithms such as k-means, and also look into dimensionality reduction systems such as principle component analysis. Our prior post on big data discusses a number of these topics in more detail as well.


We’ve covered much of the basic theory underlying the field of Machine Learning here, but of course, we have only barely scratched the surface.

Keep in mind that to really apply the theories contained in this introduction to real life machine learning examples, a much deeper understanding of the topics discussed herein is necessary. There are many subtleties and pitfalls in ML, and many ways to be lead astray by what appears to be a perfectly well-tuned thinking machine. Almost every part of the basic theory can be played with and altered endlessly, and the results are often fascinating. Many grow into whole new fields of study that are better suited to particular problems.

Clearly, Machine Learning is an incredibly powerful tool. In the coming years, it promises to help solve some of our most pressing problems, as well as open up whole new worlds of opportunity. The demand for ML engineers is only going to continue to grow, offering incredible chances to be a part of something big. I hope you will consider getting in on the action!

This article was originally published in Toptal.

Read more…

Guest blog post by Chris Atwood

Recently, I rediscovered a TED Talk by David McCandless, a data journalist, called “The beauty of data visualization.” It’s a great reminder of how charts (though scary to many) can help you tell an actionable story about a topic in a way that bullet points alone usually cannot. If you have not seen the talk, I recommend you take a look for some inspiration about visualizing big ideas.


In any social media report you make for the brass, there are several types of data charts to help summarize the performance of your social media channels; the most common ones are bar charts, pie/donut charts and line graphs. They are tried and true but often overused, and are not always the best way to visualize the data to then inform and justify your strategic decisions. Below are some less common charts to help you tell the story about your social media strategy’s ROI.


For our examples here, we’ll primarily be examining a brand’s Facebook page for different types of analyses on its owned post performance.


Scatter plots

Figure 1: Total engagement vs total reach, colored by post type (Facebook Insights)

What they are: Scatter plots measure two variables against each other to help users determine where a correlation or relationship between those variables might be.


Why they’re useful:  One of the most powerful aspects of a scatter plot is its ability to show nonlinear relationships between variables. They also help users get a sense of the big picture. In the example above, we’re looking for any observable correlations between total engagement (Y axis) and total reach (X axis) that can guide this Facebook page’s strategy. The individual dots are colored by the post type — status update (green), photo (blue) or video (red).


This scatter plot shows that engagement and reach have a direct relationship for photo posts because it makes a fairly clear, straight line from the bottom left to the upper  right. For other types of posts, the relationships are less clear, although it can be noted that video posts have extremely high reach even though engagement is typically low.


Box plots

Figure 2: Total reach benchmark by post type (Facebook Insights)


What they are: Box plots show the statistical distribution of different categories in your data, and let you compare them against one another and establish a benchmark for a certain variable. They are not commonly used because they’re not always pretty, and sometimes can be a bit confusing to read without the right context.


Why they’re useful: Box plots are excellent ways to display key performance indicators. Each category (with more than one post) will show a series of lines and rectangles; the box and whisker show what’s called the interquartile range (IQR). When you look at all the posts, you can split the values up into groups called quartiles or percentiles based on the distribution of the values. You can use the median or the value of the second quartile as a benchmark for “average” performance.


In this example, we’re once again looking at different post types on a brand’s Facebook page, and seeing what the total reach is like for each. For videos (red), you can see that the lower boundary for the reach is higher than the majority of photo posts, and that it doesn’t have any outliers. Photos, however, tell a different story. The first quartile is very short, while the fourth quartile is much longer. Since most of the posts fall above the second quartile, you know that many of these posts are performing above average. The dots above the whisker indicate outliers — i.e., these posts do not fall within the normal distribution. You should take a closer look at outliers to see what you can learn based on what they have in common (seasonality/timing, imagery, topic, audience targeting, or word choices).

Heat maps

Figure 3: Average total engagement per day by post type (Facebook Insights)


What they are: Heat maps are a great way to determine factors like which posts have the highest number of engagement or impressions, on average, on a given day. Heat maps take two categories of data and compare a single quantitative variable (average total reach, average total engagement, etc.).


Why they’re useful: The difference in the shade in colors shows how values in each column are different from each other. If the shades are all light, there is not a large difference in the values from category to category, versus if there are light colors and darker colors in a column, the values are very different from each other (more interesting!).


You could run a similar analysis to see what times  of day your posts get the highest engagement or reach, and find the answer to the classic question, “When should I post for the highest results?” You can also track competitors this way, to see how their content performs throughout the day or on particular days of the week. You can time your own posts around when you think shared audiences may be paying less attention to competitors, or make a splash during times with the best performance.


In the above example, you can see that three post types from a brand’s Facebook page have been categorized by their average total engagement on a given day of the week. Based on the chart, photos do not differentiate much from day to day. Looking closer at the data from the previous box plot, we know that photo posts are the most common post, and make up a large amount of the data set; we can conclude that the user must be used to seeing those posts so they perform about the same day to day. We also see that video posts either perform far above or far below average, and that it appears the best day to post videos for this brand is typically on Thursdays.

Tree maps

Figure 4: Average total engagement by content pillar and post type (Facebook Insights)


What they are: Tree maps use qualitative information, usually represented as a diagram that grows from a single trunk and ends with many leaves. Tree maps typically have three main components that help you tell what’s going on — the size of each rectangle, the relative color and what the hierarchy is.


Why they’re useful: Tree maps are a fantastic way to get a high-level look at your social data and figure out where you want to dig in for further analysis. In this example, we’re able to compare the average total engagement between different post types, broken out by content pillar.

For our brand’s Facebook page, we have trellised the data by post type (figure 4); in other words, we created a visualization that comprises three smaller visualizations, so we can see how the post type impacts the average total engagement for each content pillar. It answers the question, “Do my videos in category X perform differently than my photos in the same category?” You can also see that the rectangles vary in size from content pillar to content pillar; they  are sized by the number of posts in each subset. Finally, they are colored by the average total engagement for that content pillar’s subset of the post type. The darker the color, the higher the engagement.


We immediately learn that posts in the status trellis aren’t performing anywhere near the other post types (it only has one post), and that photos have the greatest number of content pillars or the greatest variety in topic. You can see from the visualization that you want to spend more of your energy digging into why posts in the Timely, Education and Event categories perform well in both photos and videos. .  


TL;DR: Better Presentations are made with Better Charts

In your next analysis, you shouldn’t disregard the tried and true bar charts, pie graphs and line charts. However, these four different visualizations may offer a more succinct way to summarize your data and help you explain the performance of your campaigns. They’ll also make your reports and wrapups look distinctive when they’re used correctly. Although there are other chart types that are also useful for making better analyses and presentations, the ones discussed here are fairly simple to put together and nearly all of them can be put together in Microsoft Excel or visualization/analysis software such as TIBCO's Spotfire. 

Read more…

Top 5 graph visualisation tools

Data visualisation is the process of displaying data in visual formats such as charts, graphs or maps. This method is commonly used to grasp more meaning out of a snapshot of data which in other approaches might require sorting through piles of spreadsheets and great quantity of reports. With the amount of data growing rapidly it is more important than ever to interpret all of this information correctly and quickly to make well-informed business decisions.

Graph visualisation has a fairly similar approach just with a more diverse and complex sets of data.  A graph is a representation of objects (nodes) which some are connected by links. Graph visualisation is the process of displaying this data graphically to maximise readability and allow to gain more insight.  

Here is the list of top graph visualisation tools that Data to Value found useful.




Gephi is an interactive visualization and exploration solution that supports dynamic and hierarchical graphs. Gephi’s powerful OpenGL engine allows for real-time visualisation that supports networks up to 50,000 nodes and 1,000,000 edges. It also provides the user with cutting edge layout algorithms that include force-based algorithms and multi-level algorithms to make the experience that much more efficient. Moreover, the software is completely free to use, the company only charges for private repositories.



Tom Sawyer Perspectives


Tom Sawyer Perspectives is a graphics based tool for building data relationship visualization and analysis applications. This software supports two graphic modules: designer and previewer. The designer helps users to define schemas, data sources, rules and searches. The previewer can be used to iteratively view the application design without the need to recompile. Using these two tools together can drastically increase the speed of application development. Other features of Tom Sawyer Perspectives include: data integration for structured semi-structured and unstructured data, multiple view support and advanced graph analytic capabilities.




Keylines is a Javascript toolkit that allows to create a custom network visualisation in a quick and easy way. Keylines puts more freedom into the user’s hands as it is not a pre-built application, this enables the developer to change nodes, links, menus and add entire functions with a few lines of code. This solution also includes geospatial integration, time bars, various layout patterns and filtering options.








An intuitive graph visualisation solution that offers an easy out-of-the-box set up with no configuration needed. This allows people who are not particularly tech savvy to start analysing and discovering insights in complex data. Linkurious includes a robust search engine helping users find text in edges and nodes. Moreover, it includes advanced analytic capabilities allowing to combine filters to answer complex questions. Linkurious also has the ability to identify complex patterns by using Cypher – a query language designed specifically for graph analytics.





GraphX is an advanced graph visualization software, it is an open-source project and is a part of the Apache Spark engine. As it is open-source there is a lot of room for customisation from special functions to custom animations. It also utilises Spark’s computing technology that allows for the capture and storing of visual and data graphs in-memory. It has built in default support for layout algorithms, advanced graph edges and vertex features. GraphX also includes a visual preview function for all controls as well as rich usability documentation and user support.




About us


Data to Value are a specialist Data consultancy, based in London. We apply graph technology to a variety of data requirements as part of next generation data strategies. Contact us for more details if you are interesting in finding out how we can help your organisation leverage this approach.

Originally posted on Data Science Central

Read more…

Guest blog post by Divya Parmar

To once again demonstrate the power of MySQL (download), MySQL Workbench (download), and Tableau Desktop (free trial version can be downloaded here), I wanted to walk through another data analysis example. This time, I found a Medicare dataset publicly available on and imported it using the Import Wizard as seen below.



Let’s take a look at the data: it has hospital location information, measure name (payment for heart attack patient, pneumonia patient, etc), and payment information.


I decided to look at the difference in lower and higher payment estimates for heart attack patients for each state to get a sense of variance in treatment cost. I created a query and saved it as a view.


One of the convenient features of Tableau Desktop is the ability to connect directly to MySQL, so I used that connection to load my view directly into Tableau.


 I wanted to see how the difference between lower and higher payment estimate varies by state. Using Tableau’s maps and geographic recognition of the state column, I used a few drag-and-drop moves and a color fill to complete the visualization.

You can copy the image itself to use elsewhere, choosing to add labels and legends if necessary. Enjoy. 

About: Divya Parmar is a recent college graduate working in IT consulting. For more posts every week, and to subscribe to his blog, please click here. He can also be found on LinkedIn and Twitter

Read more…

Guest blog post by Ujjwal Karn

I created an R package for exploratory data analysis. You can read about it and install it here.  

The package contains several tools to perform initial exploratory analysis on any input dataset. It includes custom functions for plotting the data as well as performing different kinds of analyses such as univariate, bivariate and multivariate investigation which is the first step of any predictive modeling pipeline. This package can be used to get a good sense of any dataset before jumping on to building predictive models.

The package is constantly under development and more functionalities will be added soon. Pull requests to add more functions are welcome!

The functions currently included in the package are mentioned below:

  • numSummary(mydata) function automatically detects all numeric columns in the dataframe mydata and provides their summary statistics
  • charSummary(mydata) function automatically detects all character columns in the dataframe mydata and provides their summary statistics
  • Plot(mydata, dep.var) plots all independent variables in the dataframe mydata against the dependant variable specified by the dep.var parameter
  • removeSpecial(mydata, vec) replaces all special characters (specified by vector vec) in the dataframe mydata with NA
  • bivariate(mydata, dep.var, indep.var) performs bivariate analysis between dependent variable dep.var and independent variable indep.var in the dataframe mydata

More functions to be added soon. Any feedback on improving this is welcome!

Read more…

Originally posted on Data Science Central

Contributed byBelinda Kanpetch, she is current Architecture graduate student in Columbia University. With the strong urban design sense, she is fascinated in Urban installation art and urge to acquire any elements to ameliorate urban space. In order to gather all the information systematically to apply into her work, she took NYC Data Science Academy 12 week full-time Data Science Bootcamp program April 11th to July 1st 2016. The post was based on her first class project(due at 2nd week of the program).

Why Street Trees?

The New York City street tree can sometimes be taken for granted or go unnoticed. Located along paths of travel they stand steady and patient; quietly going about their business of filtering out pollutants in our air, bringing us oxygen, providing shade during the warmer months, blocking winds during cold seasons, and relieving our sewer systems during heavy rainfall. All of this while beautifying our streets and neighborhoods. Some recent studies have found a link between presence of streets and lower stress levels in urban citizens.

So what makes a street tree different from any other tree? Mainly its location. A street tree is defined as any tree that lives within the public right of way; not in a park or on private property. Although they reside in the public right of way (or within the jurisdiction of The Department of Transportation) they are the property of and cared for by the NYC Department of Parks and Recreation.

With the intent to understand the data and explore what the data was telling me I started with some very basic questions:

  • How many street trees are there in Manhattan?
  • How many different species are there?
  • What is the general condition of the street trees?
  • What is the distribution of species by community district?
  • Is there a connection between median income of a community district to the number of street trees?

The Dataset

The dataset used for this exploratory visualization was downloaded from the NYC Open Data Portal and was collected as part of TreeCount!2015, a street tree census maintained by the NYC Department of Park and Recreation. The first census count was 1995 and has been conducted every 10 years by trained volunteers.

Some challenges with this dataset involved missing values in the form of unidentifiable species types. There were 2285 observations with unclassifiable species type, 487 observations that had unclassifiable community districts, geographic information (longitude and latitude) were character strings that had to be split into different variables, and species codes were given by 4 letter characters without any reference to genus, species, or cultivar and I had to find another dataset to decipher that code.

Visualizing the data

A quick summary of the dataset revealed a total of 51,660 trees total in Manhattan with 91 identifiable species with one ‘species’ as missing values.

A bar plot of all 92 species gave an interesting snapshot of the range in total number of trees per species. It was quite obvious that there was one species that has a dominant presence. In order to get better understanding of their counts and what were common species, I broke them down by quartiles and plotted them.

Plotting the first quartile (< 3.75)revealed that there were several species in which there was only one tree that existed in Manhattan!

The distribution within the 4th quartile (181.75 << total >> 11529) was informative in that it helped to visualize the dominance of two specific species, the Honeylocust and Ornamental Pear that make up 23% and 15% of all the trees in Manhattan respectively. Coming in close were Ginko trees with 9.47% and London Plane with 7.8%. This quartile also contained the missing species group ‘0’.

A palette of the top 4 species in Manhattan.

Looking at trees by Community District

I wanted to look at community districts as opposed to zip codes because in my opinion community districts are more representative of community cohesiveness and character. So I plot the distribution by community district and tree condition.

Plotting the species distribution by community board using facet grid helped visualize other species that were not showing up dominant in the previous graphs. It would be interesting to look further into what those species are and why they are more dominant within some community districts and not others.

Attempts at mapping

The ultimate goal was to map each individual tree location on a map of Manhattan with the community districts outlined or shaded in. I attempted to plot them on a map using leaflet, bringing in shape files and converting to a data frame, and ggplot but neither yielded anything useful. The only visualization I was able to get was using qplot which took over 2 hours to render.

Read more…

Originally posted on Data Science Central

By: Rawi Nanakul and Marnie Morales 

Rawi will present these ideas during a live webinar on May 24th at 9 AM PT / 12 PM ET. Get your questions answered in real-time during this one hour event. Register here 

We create, interpret, and experience stories every day, whether we realize it or not. Our brains are constantly receiving input and stringing things together in order for us to make sense of the world. While our brains create countless stories, only the few great ones stay with us. These make us cry, laugh, or embrace a new perspective.

Understanding how our brains interpret the world can help us become better storytellers. That’s where neuroscience comes in. The field of neuroscience covers anything that studies the nervous system, from studies on molecules within nerve endings to data processing, to even complex social behaviors like economics.

Take the Reader from the Known to the Unknown

So let’s put our brains to the test. Take a look at this image for a few seconds. What do you see?

We know very little about this scene. But because our brains crave structure, we still try to see the story. We take things we know—boxing gloves, children, and a corner man—and try to infer what the unknown might be.

A good story takes us from the Known to the Unknown. This simple premise is the key to telling stories for the brain. Let’s apply this concept to a comic. Why a comic? Comics are similar to data stories in that they present a sequence of panes containing different data points that lead you through a story.

Credit: xkcd


Election year is coming up.
The common joke of “if X wins, then I am leaving the country.”

Unknown (Punchline):

Dying in Canada = real.
Canada is the matrix.

What did we do in the course of reading the comic? We’re going to look at some basic brain anatomy to understand what our brain does when reading something like this.

Good Stories Activate More Parts of Our Brain

As you look at the comic, the prefrontal cortex in your frontal lobe kicks into gear, and your brain’s cognitive control goes to work. You're also processing data that comes into your brain as visual input. From your eyes, that data is sent to the primary visual cortex at the back of your brain and onward along two processing streams: the "what" and the "where" pathways.

The "what" pathway (in purple) uses detailed visual information to identify what we see. It pieces together the lines and figures that add up to the comic's characters. It also recognizes the letters and words, and helps deciphers their meaning with the help of additional cortical regions like Wernicke's Area, a part of our language system.

The "where" pathway (in green) processes where things are in space. We know this data stream is important and active during reading because adults with reading disabilities like dyslexia often have disrupted functioning of this pathway.

So when we're interpreting visual information, we're activating quite a bit of our brains to make sense of the data we're presented.

Things get more complex from there, because as we interpret the stories we see, even more brain areas become active. Part of the way we comprehend stories is through a simulation of what we see. So you can potentially activate parts of your brain involved in motor control or your sense of touch.

And imagine if you connect emotionally to the story you're reading. You'll be activating areas of your brain involved in emotion (the limbic system). So when reading a good story, whether it's prose, a comic strip, a data-driven story, you have the potential to get almost global activation of your brain. And the most impactful and memorable stories are those that engage us most.

Channel Your Inner Oddball

Now that we know some of the anatomy, let’s look at the behavioral applications of what we know. Take a look at the figures and read them from left to right. Which one is not like the others? We can quickly see which figure is out of place. Our eyes jump right to it.

How did we know which one was the oddball figure without anyone telling us what it looked like? We had already established a baseline that our initial figure was the normal figure. And when the outlier was presented, we knew right away that it didn't belong.

This experiment is a common attentional process test called the oddball paradigm. A baseline is presented through repetition, then an oddball is presented. This should remind you of our Known-to-Unknown formula that I mentioned earlier. By creating a strong baseline, when the oddball—or an unexpected twist or climax—occurs, we are prepared for it and enjoy it.

Our brain is processing the information based on our experience of the information input. Below is a figure of an ERP, or event-related potential. ERPs are averaged waveforms that measure electrical activity from your scalp. We can use them to measure reaction speed to attentional processing.

Olichney, Nanakul, et al. 2012

In the left figure, we see our brains when presented with standard stimuli (each tick mark is 100ms). You see that we have relatively flat lines after the initial peak. The flat lines are expected because standard stimuli are essentially noise, and our mind zones out because it has been normalized.

The figure on the right shows the oddball—or target—tone with a peak of 300ms (also known as a P300). This peak is from our brain detecting the oddball and concluding that this is the item to pay attention to. This peak is only possible through having established a clear baseline.

What This Means for Storytelling

The example above shows us we have to lay down a good foundation and logical progression to get to our peak. Without structure, our audience will experience our story as noise and tune out, like our figure on the left.

When creating your own stories, remember that the brain craves structure and loves oddballs. The brain processes information by taking information it already knows to infer what a new piece of information might be. Therefore, making it easy as possible for the brain to understand the story is key to delivering a successful climax or twist.

Now that you have some basic understanding of brain anatomy and neuroscience, try applying the lessons learned to your data stories. Create dashboards that engage the senses through pleasing designs, shapes, color, text, and interactivity. Embrace the oddball paradigm by clearly establishing a baseline before delivering your findings. That way, the audience’s mind will be primed to attend to it. And their brains will help them remember your story as one of the few good ones.

Learn More about Storytelling with Data

Rawi Nanakul will present these ideas during a live webinar on May 24th. Register here 

Read more…

The bagged trees algorithm is a commonly used classification method. By resampling our data and creating trees for the resampled data, we can get an aggregated vote of classification prediction. In this blog post I will demonstrate how bagged trees work visualizing each step.

Visualizing Bagged Trees as Approximating Borders, Part 1

Visualizing Bagged Trees, Part 2

Conclusion: Other tree aggregation methods differ in how they grow trees and they may compute weighted average. But in the end we can visualize the result of a algorithm as borders between classified sets in a shape of connected perpendicular segments, as in this 2-dimensional case. As for higher dimensions these became multidimensional rectangular pieces of hyperplanes which are perpendicular to each other.

Read more…

Contributed by Bin Lin. He took  NYC Data Science Academy 12 week full-time Data Science Bootcamp program between Jan 11th to Apr 1st, 2016. The post was based on his second class project(due at 4th week of the program).


The consumption pattern is an important driver of the development pattern of the industrialized world. The consumption price changes reflect the economic performance and income of households in a country. In this project, the focus is on the food price changes. The goals of the project were:

  • Utilize Shinny for interactive visualization (the Shinny app is hosted at
  • Explore food price changes over time from the year 1974 - 2015.
  • Compare food price changes to All-Items price changes (All-items include all consumer goods and services, including food).
  • Compare Consumer Food Price Changes vs. Producer Price Changes (producer price changes are the average change in prices paid to domestic producers for their output).




Consumer Food Price Changes Dataset:

  • Data dimension: 42 rows x 21 columns
  • Missing data: There are 2 missing values in the column of "Eggs"

Producer Price Changes Dataset:

  • Data dimension: 42 rows x 17 columns
  • Missing data: There are 25 missing values in the column of "Processed.fruits.vegetables".

Consumer Food Categories:

  • Data dimension: 20 rows x 2 columns

Data Analysis and Visualization:

Food Consumption Categories:

Food consumption is broken out into 20 categories. Among all of them, the categories with high share based on consumer expenditures are (see Figure 1 and Figure 2):

  • Food.away.from.Home (eat out): 40.9%
  • Other.foods: 10.5 (note this is a sum of rest of the uncategorized food)
  • Cereals.and.bakery.products: 8.0$
  • Nonalcoholic.beverages: 6.7%
  • Dairy.products: 6.3
  • Beef.and.veal: 4.1
  • Fresh.fruits: 4.0

The high share of nonalcoholic beverages/soft drinks (6.7%) seems concerning as high consumption of soft drinks might pose the health risk.

Figure 1: Pie Chart on Food Categories Share of Consumer Expenditures

Figure 2: Bar Chart on Food Categories Share of Consumer Expenditures 

Food Price Changes over Time:

The Consumer Price Index (CPI) is a measure that examines average change over time in the prices paid by consumers for goods and services. It is calculated by taking price changes for each item in the predetermined basket of goods and averaging them; the goods are weighted according to their importance. Changes in CPI are used to assess price changes associated with the cost of living.

In the Shinny app, I created a line chart that showed price changes for different food categories, which were selected from a drop-down list. See Figure 3 for screenshot of the Different Food Categories Price Changes over Time

As I was looking at the food price changes, I noticed that there was the dramatic increase during the late 70s.  After reviewed history of the 1970s, a lot happened during that time of period, including the "Great Inflation".

Figure 3: Screenshot of Different Food Categories Price Changes over Time

Yearly Food Category Price Changes:

To view the food price changes for each category in a year, I created the bar chart in the Shinny app. Users can select a year from the slider; the chart will show food price changes of each category for that year. I actually created two bar charts side-by-side in case users want to compare the food price changes between any of the two years.

A quick look at the year 2015, the price of "Egg" had the biggest increase; price of "Pork" dropped the most. In fact, many food categories dropped their price. Compared 2015, the year 2014 had fewer categories with dropped price; the price of "Beef and Veal" had the biggest increase.

Figure 4: Screenshot of Food Category Price Changes by Year

Food Price Changes vs All-Items Price Changes

The Consumer Price Index (CPI) for food is a component of the all-items CPI. That led me to the comparing of those two. From the line chart, I observed:

  • Food price changes mostly aligns with all-item price changes.
  • Food price inflation has outpaced the economy-wide inflation in recent years.

Figure 5: Price Changes in All-Items vs Price Changes in Food

Food Price Changes vs Producer Price Changes

Based on United State Department of Agriculture (USDA),  changes in farm-level and wholesale-level PPIs are of particular interest in forecasting food CPIs. Therefore, I created a chart to show the Over All Food Price Changes vs Producer Price Changes. Uses can choose one or more Producer food categories.

From the chart, that food price changes mostly aligns with the producer price changes. However,  farm level milk, farm level cattle, farm level wheat seem fluctuate since year 2000 and they didn't affect the over all food price change that much. Though the impact on the over all food price was small, I doubt they might have impacted individual food categories. I would like to add a new drop-down list to allow users to select food categories from the consumer food categories.

Figure 6: Food Price Changes vs Producer Price Changes

Correlation Tile Map:

To see the relationship among the different categories in terms of price changes, I created a correlation tile map.


Food price has been increasing, in different amount of percentage. Since 1990, food price changes keep under small percentage. The degree of food price inflation varies depending on the type of foods

Looking ahead to 2016, ERS predicts food-at-home (supermarket) prices to rise 2.0 to 3.0 percent - a rate of inflation that remains in line with the 20-year historical average of 2.5 percent. For future works, I would love to try to fit a time-series model to predict the price changes for the coming five years.

Again, this project was done in Shiny and most of the information in this blog post were from the Shiny,

Originally posted on Data Science Central

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds