Subscribe to our Newsletter

Featured Posts (196)

‘Key Performance Indicators’ or KPIs as we say, are very important to the enterprise and nearly every company is talking about them, these days. But, there are still a lot of businesses that don’t know how to define the right KPIs to get a good picture of success.

To really understand where you are succeeding and where you are falling short, you have to measure the right things. For example, if your goal is to increase sales in the Minneapolis store by 5% in the year 2015, you couldn’t determine success by establishing a KPI to measure the number of shopping bags you have on hand in the store. Do we care about the number of sales people on staff at a certain time of day, and whether that affects our sales? Do we want to look at the store hours for a particular day of the week to determine whether extended hours in a certain season or on a certain day may result in more sales? Should we look at the impact of sales rep training on closed sales?

In like manner, if you want to establish metrics to evaluate the effectiveness of your internet marketing program you’ll probably have to look at your program from various perspectives. That is true of nearly every initiative in your company and that is where many businesses go awry. They assume that they can establish one metric for each goal when, in fact your business is more complex than that and your goals usually have more than one factor or aspect that will determine success.

Let’s consider the KPIs for an internet marketing program. We can’t just say that we want to increase sales. We have to decide how we will determine success. Will we include site visits, visits per page, the click to conversion ratio, the number of email and newsletter ‘unsubscribe’ requests, the click through rates for visitors coming to the website from a social media site, etc. These factors might tell us which internet marketing techniques are driving traffic to our site, but do they tell us whether this traffic is coming from our target audience, or what percentage of the traffic from each source is actually resulting in a purchase? Do they measure the time of day, the day of the week, or the season in which these sales conversions are most likely?

Of course, every business, industry, location and team is different and you have to look carefully at your own business to determine what is relevant. The most important thing to ask yourself when you establish KPIs is, ‘how does this measurement correlate to our success?’ If I measure this particular thing, does the resulting number or data point give me any insight into how well we are doing, how much money we are making, and whether this task, activity or goal is actually having an effect on the overall performance of the business?

There is one final point to consider when establishing  Key Performance Indicators (KPIs) and an integrated business intelligence approach to decision-making. Enterprise culture and communication is important. There are industry standard, and business function-specific business intelligence tools with KPI modules, but these solutions still have to be tailored to the individual organization, and to their targets, and the minimums and maximums to be defined and then gradually moved to the teams for adoption. In order to get a true picture of KPIs and business intelligence, the enterprise must integrate data from disparate data sources and systems and that takes careful planning and implementation.

Throughout this process, the business must be committed to building a performance driven culture, and to streamlining and improving communication, and, in all likelihood, the process of getting to the desired state will be an iterative process. It may seem like the enterprise is taking the long way around. But, the business team must focus on building for the long-term, and to achieving solid results and a culture that supports clear, concise, objective decision-making and full commitment to business success at every level.

If a business is committed to performance-driven management, it must link its goals to its processes and create key performance indicators that objectively measure performance and keep the company on track. Whether your goal is to create a successful eCommerce site, increase customer satisfaction by 15% or reduce expenses, you must have a good understanding of what you mean by the word ‘success’.

Read more…

Successful sales force management is dependent on up-to-date, accurate information. With appropriate, easy access to business intelligence, a Sales Director and Sales Managers can monitor goals and objectives. But, that’s not all a business intelligence tool can do for a sales team. In today’s competitive market, marketing, advertising and sales teams cannot afford to wait to be outstripped by the competition. They must begin to court and engage a customer before the customer has the need for an item. By building brand awareness and improving product and service visibility, the sales team can work seamlessly throughout the marketing and sales team channel to educate, and enlighten prospects and then carry them through the process to close the deal. To do that, the sales staff must have a comprehensive understanding of buying behaviors, current issues with existing products, pricing points and the impact of changing prices, products or distribution channels. With access to data integrated from CRM, ERP, warehousing, supply chain management, and other functions and data sources, a sales manager and sales team can create personalized business intelligence dashboards to guide them through the process and to help them analyze and understand trends and patterns before the competition strikes.

The enterprise must monitor sales results at the international, national, regional, local, team and individual sales professional. As a sales manager, you should be able to manage incentives and set targets with complete confidence, and provide accurate sales forecasts and predictions to ensure that the enterprise consistently meets its goals and can depend on the predicted revenue and profits for investment, new product development, market expansion and resource acquisition.

Business Intelligence for the sales function must include Key Performance Indicators (KPI) to help the team manage each role and be accountable for objectives and goals. If a sales region fails to meet the established plan, the business can quickly ascertain the root cause of the issue, whether it is product dissatisfaction, poor sales performance, or any one of a number of other sources.

Since the demand generated by the  sales force management directly affects the production cycle and plan, the sales team must monitor sales targets and objectives with product capacity and production to ensure that they can satisfy the customer without shortfalls or back orders. If some customers are behind on product payments, a business must be able to identify the source of the issue and address that issue before it results in decreased revenue and results.

The ten benefits listed below comprise a set of ‘must haves’ for every sales team considering a business intelligence solution:

  1. Set targets and allocate resources based on authentic data, rather than speculation
  2. Establish, monitor and adapt accurate forecasts and budgets based on up-to-date, verified data and objective KPIs
  3. Analyze current data, and possible cross-sell and up-sell revenue paths and the estimated lifetime value of a customer
  4. Analyze the elements of sales efforts (prospecting, up-selling, discounts, channel partners, sales collaterals, presentations) and adapt processes that do not provide a competitive edge and strong customer relationships and client loyalty
  5. Measure the factors affecting sales effectiveness to improve sales productivity and correct strategies that do not work
  6. Achieve a consistent view of sales force performance, with a clear picture of unexpected variations in sales and immediate corrective action and strategic adjustment based on trends and patterns
  7. Understand product profitability and customer behavior, by spotlighting customers and products with the highest contribution to the bottom line
  8. Revise expense and resource allocation using the net value of each customer segment or product group
  9. Identify the most effective sales tactics and mechanisms, and the best resources and tools, to meet organizational sales objectives
  10. Establish a personalized, automated alert system to identify and monitor upcoming opportunities and threats

When the enterprise provides a single source, integrated view of enterprise data from numerous sources and enables every user to build views, dashboards and KPIs, every member of the sales team is engaged in the pursuit of strategic, operational and tactical goals. In this way, the enterprise can acquire new clients, retain existing clients, and sell new products and services without a misstep.

Read more…

Guest blog post by SupStat

Contributed by Sharan Duggal.  You can find the original article here.

Introduction


We know that war and civil unrest account for a significant proportion of deaths every year, but how much can mortality rates be attributed to a simple lack of basic resources and amenities, and what relationship do mortality rates have with such factors? That’s what I set out to uncover using WorldBank data that covers the globe for up to the last 50 odd years, and I found a strong relationship with some of the available data.

If you were to look at overall mortality rates, the numbers would be muddied by several factors, including the aforementioned causes of death, so I decided to look at two related, but more specific outcome variables – infant mortality as well as risk of maternal death.

Infant mortality is defined as the number of infants dying before reaching one year of age, per 1,000 live births in a given year.

Lifetime risk of maternal death is the probability that a 15-year-old female will die eventually from a maternal cause assuming that current levels of fertility and mortality (including maternal mortality) do not change in the future, taking into account competing causes of death.

While I am sure these numbers can also be impacted by things like civil unrest, it does focus on individuals who are arguably more subject to be impacted by things like communicable diseases and lack of basic provisions like clean water, electricity or adequate medical resources, among others.

So, what do overall mortality rates even look like?

The density plot below includes the overall infant mortality distribution along with some metrics indicating the availability of key resources. Infant mortality rates peak at around 1% and the availability of resources peak closer to 100%. In both cases we see really long tails, indicating that there is a portion of the population experiencing less than ideal numbers.

So to drill down further, let’s have a closer look at the distribution of both outcome variables by year. The boxplots below suggest that both Infant mortality rates as well as risk of maternal death have shown not only steady overall improvements over the years but also a reduction in the disparity of cases across country-specific observations. But the upper end of these distributions still represent shocking numbers for some countries with: over 10% of infants dying every year (down from a high of 24% in 1961) and a 7.5% probability that a 15 year old girl living today will eventually die of a maternal cause (down from over 15% twenty-five years ago).

Please note: points have been marginally jittered above for clearer visual representation

Mortality Rates across the Globe


The below map plots the 2012 distribution of infant mortality rates by country. I chose 2012 because most of the covariates I would eventually like to use contain the best information from this year, with a couple of exceptions. It also presents a relatively recent picture of the variables of interest.

As can be seen, the world is distinctly divided, with many African, and some South Asian, countries bearing a bigger burden of infant mortality. And if it wasn’t noticeable on the previous boxplot, the range of values, as shown in the scale below is particularly telling of the overall disparity of mortality rates, pointing to a severe imbalance across the world.

The map representing the risk of maternal death is almost identical, and as such has been represented in a different color for differentiation. Here, the values range from close to 0% to over 7%.

Bottom Ranked Countries Over the Years


After factoring in all 50+ years of data for infant mortality and 26 years of data for risk of maternal death, and then ranking countries, the same set of countries feature at the bottom of the list.

The below chart looks at the number of times a country has had one of the worst three infant mortality rates in any given year since 1960.

The chart for maternal data goes from 1990 through to 2015. It’s important to note that Chad and Sierra Leone were ranked in the bottom 3 for maternal risk of death in every year since 1990.

Please note that numbers may be slightly impacted by missing data for some countries, especially for earlier years in the data set.

Relationship between Mortality & Resources


Getting back to the original question, are there any low hanging fruit and easy fixes for such a dichotomous situation? While my efforts during this analysis did not include any regressions, I did want to get an initial understanding of whether the availability of basic resources had a strong association with mortality rates, and if such a relationship existed, which provisions were more strongly linked with these outcomes? The findings could serve as a platform to do further research.

The below correlation analysis helped home in on some of the stronger linkages and helped weed out some of the weaker ones.

Note, the correlation analysis was run using 2012 data for all metrics, except for “Nurses and Midwives (per 1000 people)” and “Hospital beds (per 1000 people)” for which 2010 and 2009 data was used respectively, due to poorer availability of 2012 data for these measures.

 

Focusing on the first two columns of the above correlation plot, which represent risk of maternal death and infant mortality, we see a very similar pattern across the variables included in the analysis. Besides basic resources, I had also included items like availability of renewable freshwater resources and land area, to see if naturally available resources had any linkages to the outcomes in question. They didn’t and so they were removed from the analysis. In the plot above, it can also be seen that average rainfall and population density dont have much of a relationship with the mortality rates in question. What was also surprising was that access to anti-retroviral therapy too had a weak correlation with mortality rates in general.

The metrics that had the strongest relationship (in the 0.75 to 0.85 range) were:

  • Percent of population with electricity
  • Percent of population with access to non-solid fuel
  • Percent of population with access to improved sanitation facilities, and
  • Percent of population with access to improved water sources


The first two require no definitional explanation, but access to improved sanitation facilities ensure the hygienic separation of human excreta from human contact. Access to improved water sources refers to the percentage of the population using an improved drinking water source including piped water on premises and other improved drinking water sources (public taps or standpipes, tube wells or boreholes, protected dug wells, protected springs, and rainwater collection).

Analyzing the strongly correlating factors by Region


The following 4 charts look at regional performance of the key identified metrics. The pattern follows the same as that seen on the static world map from 2012, but this also gives us a view into how things have been trending on the resources that seem to be strongly linked with infant and maternal mortality over the past 25 years. We see a fairly shallow slope for Sub-saharan Africa on access to non-solid fuel as well as on improved sanitation facilities. Improvements in drinking water access have been much better.

South Asian countries ranked lowest on the provision of sanitation facilities in the early ’90s, but have made improvements since.

Conclusion


My analysis found a very strong relationship between mortality rates and basic provisions. It also weeded out some factors which were less important. As a next step, it may be helpful to do a deeper country-specific analysis for African and South Asian nations that suffer from a chronic lack of basic infrastructure, to see where investments would be most fruitful in bringing these countries to a closer state of parity with the developed world.

Read more…

Originally posted on Data Science Central

“I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding”

Hal Varian, Chief Economist at Google and emeritus professor at the University of California, better known as Berkeley, said on the 5th of August 2009.

Today, what Hal Varian said almost seven years ago has been confirmed, as is highlighted in the following graph taken from Google Trends, which gives a good idea of the current attention to figure of the Data Scientist.

The Observatory for Big Data Analytics & BI of Politecnico di Milano has been working on the theme of Data Scientists for a few years, and has now prepared a survey to be submitted to Data Scientists that will be used to create a picture of the Data Scientist, within their company and the context in which they operate.

If you work with data in your company, please support us in our research and take this totally anonymous survey here. Thank-you from the Observatory for Big Data Analytics & BI.

 

Graph 1: How many times the term "Data Scientist" has been searched on Google. The numbers in the graph represent the searched term in relation to the highest point in the graph. The value of 100 is given to the point with the maximum number of searches, the others values are proportional.

Mike Loukides, VP of O’Reilly Media, summarized the Data Scientist’s job description in these words:

"Data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others."

We are in the era of Big Data, in an era where 2,5 quintillion (10^18) of bytes are generated every day. Both the private and public sector everywhere are adapting so that they can exploit the potential of Big Data by introducing into their organizations people who are able to extract information from data.

Getting information out of data is of increasing importance because of huge amount of data available. As Daniel Keys Moran, programmer and science fiction writer, said:“You can have data without information, but you cannot have information without data”.

In companies today, we are seeing positions like the CDO (Chief Data Officer) andData Scientists more often than we were used to.

The CDO is a business leader, typically a member of the organization’s executive management team, who defines and executes analytics strategies. This is the person actually responsible for defining and developing the strategies that will direct the company’s processes of data acquisition, data management, data analysis and data governance. This means that new governance roles and new professional figures have been introduced in many organizations to exploit what Big Data offer them in terms of opportunities.

According to the report on “Big Success with Big Data” (Accenture, 2014), 89% of companies believe that, without a big data analytics strategy, in 2015 they risk losing market share and will no longer be competitive.

Collecting data is not simply retrieving information: the Data Scientists’ role is to translate data into information, and currently there is a dearth of people with this set of skills.

It may seem controversial, but both companies and Data Scientists know very little about what skills are needed. They are operating in a turbulent environment where frequent monitoring is needed to know who actually uses which tools, which tools are considered old and becoming obsolete, and which are those used by the highest and lowest earners. According to a study by RJMetrics (2015), the Top 20 Skills of a Data Scientist are those contained in the following graph. 

The graph clearly shows the importance of tools and programming languages such as Rand PythonMachine LearningData Mining and Statistics are also high up in the set of most requested skills. Those relating to Big Data are at about the 15th place.

The most recent research on Data Scientists showed that these professionals are more likely to be found in companies belonging to the ICT sectorinternet companies andsoftware vendors, such as Microsoft and IBM, rather than in social networks(Facebook, LinkedIn, Twitter) AirbnbNetflix etc. The following graph, provided – like the previous one - by RJMetrics, gives the proportion of Data Scientists by industry.

It is important to keep monitoring Data Scientists throughout industrial sectors, their diffusion and their main features, because, in the unsettled business world of today, we can certainly expect a great many changes to take place while companies become aware, at different times and in different ways, of the importance of Data Scientists

Read more…

Shopper Marketing -Infographic

Originally posted on Data Science Central

This infographic on Shopper Marketing was created by Steve Hashman and his team. Steve is Director at Exponential Solutions (The CUBE) Marketing. 

Shopper marketing focuses on the customer in and at the point of purchase. It is an integrated and strategic approach to a customer’s in-store experience which is seen as a driver of both sales and brand equity.

For more information, click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Vimal Natarajan

Introduction


The City and County of San Francisco had launched an official open data portal called SF OpenData in 2009 as a product of its official open data program, DataSF. The portal contains hundreds of city datasets for use by developers, analysts, residents and more. Under the category of Public Safety, the portal contains the list of SFPD Incidents since Jan 1, 2003.
In this post I have done an exploratory time-series analysis on the crime incidents dataset to see if there are any patterns.

Data

The data for this analysis was downloaded from the publicly available dataset from the City and County of San Francisco’s OpenData website SF OpenData. The crime incidents dataset has data recorded from the year 2003 till date. I downloaded the full dataset and performed my analysis for the time period from 2003 to 2015, filtering out the data from the year 2016. There are nearly 1.9 million crime incidents in this dataset.
I have performed minimal data processing on the downloaded raw data to facilitate my analysis.

Analysis

Crimes by Year

The following plot depicts the crimes recorded from the year 2003 till the end of the year 2015.

The horizontal line represents the average number of crimes during those years, which is just below 150,000 crimes per year. As you can observe from the year 2003 till 2007 the number of crime incidents decreased steadily. But in the year 2008 and 2009 there was a slight increase in the number of crime incidents. These two years is when the United States went through the financial and subprime mortgage crisis resulting in what is called as the Great Recession. According to the US National Bureau of Economic Research the recession began around January 2008 and ended around June 2009. As most statisticians say, “Correlation does not imply causation”, I too want to emphasize that without additional data and insights from its related analysis it may be not possible to relate these two events, but nevertheless it is an interesting observation. Following that period, there was a slight decrease in the crime incidents during the next two years but it has increased since 2012 ending up above average from year 2013 to year 2015.

Mean crimes by month

The following plot depicts the mean crimes for each month from January till December. You can observe that the mean crime for each month is more or less around the monthly average which is just below 12,000 (horizontal line). One interesting observation is that the mean crime is significantly below the monthly average for the months of February, November and December. The possible reasons could be that the month of February has less number of days compared to the other months and the festive and holiday season during the months of November and December.

Mean crimes by day of the month

The following plot depicts the mean crimes for the different days of the month.You can observe that the mean crime for each day of the month is pretty much around the daily average which is just below 400 (horizontal line) for the days from the 2nd of the month till the 28th. The mean crime during the first day of the month is significantly above average. One possible reason could be that the first day of the month is usually the pay day. Again, a correlation does not imply causation. Without additional related data and insights derived from the analysis of that data we cannot be sure. The 29th and 30th are also below average and the reason could be that the month of February does not have those days. The mean crime for the 31st of the month is around half of the daily average and that might be due to the reason that only half of the months in a year has the 31st day.

Mean crimes by hour of the day

The following plot depicts the mean crimes by the hour of the day.You can observe that this plot is very different from the other plots in the sense that the crime incidents are far from the hourly average which is around 16 (horizontal line). But within this plot you can observe some interesting pattern like the fact that crime incidents are well above average around midnight and decline steadily and significantly below the hourly average till early morning around 5 AM. From the early morning hours starting at 6 AM you can observe that the crime incidents steadily increase and spikes around noon. From noon, it is well above average peaking around 6 PM in the evening and then declining after 6 PM.

Mean crimes by day of the week

The following plot depicts the mean crimes by the day of the week.As you can observe, Sunday has the least number of crime incidents, well below the daily average which is just below 400 (vertical line) and Friday has the most number of crime incidents well above the daily average.

Mean crimes during holidays

The following plot depicts the mean crimes during few key days like holidays in the United States.You can observe here that the number of crime incidents is significantly high during the New Year, well above the daily average which is just below 400 (horizontal line). During the other holidays the number of crime incidents is more or less same as the daily average but during the Christmas Eve and the day of Christmas the number of crime incidents is significantly lower than the daily average. Since Thanksgiving Day falls in different dates each year, as an approximation I chose the date of November 24 here. I was expecting to see significantly lower crime incidents during this time period, but it does not seem to be the case.

Conclusion

In conclusion, based on the above observation, we can see some patterns in the crime incidents and arrive at the following conclusions:

  • The average number of crime incidents happening daily in the City and County of San Francisco is around 400.
  • The number of crime incidents is highest around midnight and lowest at the early morning hours.
  • The number of crime incidents is usually lower during Christmas.
  • The number of crime incidents has been slowly increasing in the recent years.
  • The number of crime incidents is high during New Year day and at the beginning of every month.

The above is just a high-level exploratory time-series analysis. With further in-depth analysis it is possible to arrive at more insights. In my future posts I will try to perform those analyses.

Technology

This analysis was performed entirely using RStudio version 0.99 and R Version 3.2.0.
The data processing and plots were done using the R libraries ggplot2 and dplyr.

Read more…

Guest blog post by Dante Munnis

Digging through messy data and doing numerous calculations just so you can submit a report or arrive at the result of your quarterly business development can sometimes be nigh impossible. After all, we are only human, and by the time we get to the other side of our spreadsheet equation, we have lost all sense of what we were trying to accomplish.

Luckily, there are data visualization and analysis tools out there that can do most of the heavy lifting for us. Remember, you will still need to do some of the work yourself, but putting it all together will become that much simpler. Let’s take a look at some of the best data analysis tools at our disposal.

1. Open Refine

At first, you might be surprised as to how much Open Refine resembles Google’s own Spread Sheets. This is because it started as a Google project but quickly became crowd sourced and independent. In practice, this means that Open Refine has all the built-in algorithms and formulas that you might need for your business data analysis.

Keep in mind that while it does resemble Spread Sheets, it doesn’t have the regular features you would expect, such as manual cell manipulation and custom algorithms. You would need to export your data and bring it back in. If that doesn’t cause too much headache, you might want to give Open Refine a shot.

2. Data Wrangler

Stanford University’s own data analysis tool is open to public use. While text manipulation and web-based interface is certainly a plus, you might consider the other factors as well. Some of the formulas provided as default don’t work really well with large amounts of data, often giving off false results or downright crashing the tool. While easy and accessible to use, Data Wrangler might not be a good tool for internal and sensitive data, since all of the data is stored at Stanford for research purposes.

3. Rapid Miner

As one of the best data visualization tools out there, Rapid Miner had to find its way to our list. It can not only manipulate and calculate custom data, analyze the required results but also model and visualize the results. This award-winning tool is known to provide great results no matter the data you are trying to analyze.

The near-perfect visualization system is just an added bonus considering everything that you are getting. If you need a tool that can help you lead and develop projects with coworkers that are less than adept at analysis, Rapid Miner is the perfect tool for the job.

4. Wolfram Alpha

Ever wonder how it would feel like if you had a personal computing assistant? Wolfram Alpha is exactly one such platform. Think of Google Search but for business analytics and data research. Whatever your field of work and specifics needs, you can be sure that Wolfram will make sense of it and help you decode any problems that you might be experiencing.

5. Solver

Sometimes you don’t need external apps or web services for your data analysis. Solver is one such addition to your Excel spread sheets. Offering a vast variety of optimization and programming algorithms, Solver will help you make sense of your data at a much faster rate than you otherwise would. It’s light, fast and easy to use, so there’s no reason not to give it a shot. Keep in mind that Solver won’t be able to make sense of more complex and demanding analysis tasks, so make sure that you use it in a smart way.

6. Google Fusion Tables

While it may not be the most versatile or complex tool on the web, Fusion Tables is one of the most accessible data visualization services out there. The best thing about it is that it is free to use and very approachable, so there’s no need for spending hours on end learning about what’s what. You can visualize your data in any shape or form you desire. Just keep in mind that you can use this tool of simple calculations and not vast sprawling data analysis tasks.

7. Zoho Reports

You might have heard about Zoho, since it’s one of the most popular business data analysis tools on the web. It’s fairly easy to get into and use, requiring only a simple log-in and data input. Use Zoho to quickly and professionally turn your data into charts, tables and pivots in order to use them for further research.

8. NodeXL

Taking the best from both worlds, Node XL is simple to use and fairly advanced in it’s algorithm possibilities. You can not only analyze and visualize raw data, but use it to develop and visualize networks and relations between different results. While some of the features might be too advanced for everyday data analysis, NodeXL is the perfect tool for more complex tasks.

9. Google Chart Tools

Another Google tool on our list that provides visualization and analysis but doesn’t focus on raw data. Instead, you can point the tool at different sources on the web and make ends meet in the visualized charts, analyzing outsourced data in order to get the results that you need. While it’s very useful and provides accurate data, Google Chart Tools isn’t very user friendly, requiring a bit of programming knowledge in order to fully utilize it’s capabilities.

10. Time Flow

Data analysis sometimes requires a different kind of visualization. Time Flow is a tool that can analyze and visualize time points and create a data map that provides a clear picture of how and when your specific data developed. While it does sound complex, the tool itself is fairly easy to use and allows a plethora of customization options. Use Time Flow whenever you need to create timelines and streamline your data.

About author: Dante Munnis is a media and marketing expert currently working at Essay Republic. He shares ideas and experience on how to build your brand and attract more customers

Read more…

Originally posted on Data Science Central

Things, not Strings
Entity-centric views on enterprise information and all kinds of data sources provide means to get a more meaningful picture about all sorts of business objects. This method of information processing is as relevant to customers, citizens, or patients as it is to knowledge workers like lawyers, doctors, or researchers. People actually do not search for documents, but rather for facts and other chunks of information to bundle them up to provide answers to concrete questions.

Strings, or names for things are not the same as the things they refer to. Still, those two aspects of an entity get mixed up regularly to nurture the Babylonian language confusion. Any search term can refer to different things, therefore also Google has rolled out its own knowledge graph to help organizing information on the web at a large scale.

Semantic graphs can build the backbone of any information architecture, not only on the web. They can enable entity-centric views also on enterprise information and data. Such graphs of things contain information about business objects (such as products, suppliers, employees, locations, research topics, …), their different names, and relations to each other. Information about entities can be found in structured (relational databases), semi-structured (XML), and unstructured (text) data objects. Nevertheless, people are not interested in containers but in entities themselves, so they need to be extracted and organized in a reasonable way.

Machines and algorithms make use of semantic graphs to retrieve not only simply the objects themselves but also the relations that can be found between the business objects, even if they are not explicitly stated. As a result, ‘knowledge lenses’ are delivered that help users to better understand the underlying meaning of business objects when put into a specific context.

Personalization of information
The ability to take a view on entities or business objects in different ways when put into various contexts is key for many knowledge workers. For example, drugs have regulatory aspects, a therapeutical character, and some other meaning to product managers or sales people. One can benefit quickly when only confronted with those aspects of an entity that are really relevant in a given situation. This rather personalized information processing has heavy demand for a semantic layer on top of the data layer, especially when information is stored in various forms and when scattered around different repositories.

Understanding and modelling the meaning of content assets and of interest profiles of users are based on the very same methodology. In both cases, semantic graphs are used, and also the linking of various types of business objects works the same way.

Recommender engines based on semantic graphs can link similar contents or documents that are related to each other in a highly precise manner. The same algorithms help to link users to content assets or products. This approach is the basis for ‘push-services’ that try to ‘understand’ users’ needs in a highly sophisticated way.

‘Not only MetaData’ Architecture
Together with the data and content layer and its corresponding metadata, this approach unfolds into a four-layered information architecture as depicted here.

Following the NoSQL paradigm, which is about ‘Not only SQL’, one could call this content architecture ‘Not only Metadata’, thus ‘NoMeDa’ architecture. It stresses the importance of the semantic layer on top of all kinds of data. Semantics is no longer buried in data silos but rather linked to the metadata of the underlying data assets. Therefore it helps to ‘harmonize’ different metadata schemes and various vocabularies. It makes the semantics of metadata, and of data in general, explicitly available. While metadata most often is stored per data source, and therefore not linked to each other, the semantic layer is no longer embedded in databases. It reflects the common sense of a certain domain and through its graph-like structure it can serve directly to fulfill several complex tasks in information management:

  • Knowledge discovery, search and analytics
  • Information and data linking
  • Recommendation and personalization of information
  • Data visualization

Graph-based Data Modelling
Graph-based semantic models resemble the way how human beings tend to construct their own models of the world. Any person, not only subject matter experts, organize information by at least the following six principles:

  1. Draw a distinction between all kinds of things: ‘This thing is not that thing’
  2. Give things names: ‘This thing is my dog Goofy’ (some might call it Dippy Dawg, but it’s still the same thing)
  3. Categorize things: ‘This thing is a dog but not a cat’
  4. Create general facts and relate categories to each other: ‘Dogs don’t like cats’
  5. Create specific facts and relate things to each other: ‘Goofy is a friend of Donald’, ‘Donald is the uncle of Huey, Dewey, and Louie’, etc.
  6. Use various languages for this; e.g. the above mentioned fact in German is ‘Donald ist der Onkel von Tick, Trick und Track’ (remember: the thing called ‘Huey’ is the same thing as the thing called ‘Tick’ - it’s just that the name or label for this thing that is different in different languages).

These fundamental principles for the organization of information are well reflected by semantic knowledge graphs. The same information could be stored as XML, or in a relational database, but it’s more efficient to use graph databases instead for the following reasons:

  • The way people think fits well with information that is modelled and stored when using graphs; little or no translation is necessary.
  • Graphs serve as a universal meta-language to link information from structured and unstructured data.
  • Graphs open up doors to a better aligned data management throughout larger organizations.
  • Graph-based semantic models can also be understood by subject matter experts, who are actually the experts in a certain domain.
  • The search capabilities provided by graphs let you find out unknown linkages or even non-obvious patterns to give you new insights into your data.
  • For semantic graph databases, there is a standardized query language called SPARQL that allows you to explore data.
  • In contrast to traditional ways to query databases where knowledge about the database schema/content is necessary, SPARQL allows you to ask “tell me what is there”.

Standards-based Semantics
Making the semantics of data and metadata explicit is even more powerful when based on standards. A framework for this purpose has evolved over the past 15 years at W3C, the World Wide Web Consortium. Initially designed to be used on the World Wide Web, many enterprises have been adopting this stack of standards for Enterprise Information Management. They now benefit from being able to integrate and link data from internal and external sources with relatively low costs.

At the base of all those standards, the Resource Description Framework (RDF) serves as a ‘lingua franca’ to express all kinds of facts that can involve virtually any kind of category or entity, and also all kinds of relations. RDF can be used to describe the semantics of unstructured text, XML documents, or even relational databases. The Simple Knowledge Organization System (SKOS) is based on RDF. SKOS is widely used to describe taxonomies and other types of controlled vocabularies. SPARQL can be used to traverse and make queries over graphs based on RDF or standard schemes like SKOS.

With SPARQL, far more complex queries can be executed than with most other database query languages. For instance, hierarchies can be traversed and aggregated recursively: a geographical taxonomy can then be used to find all documents containing places in a certain region although the region itself is not mentioned explicitly.

Standards-based semantics also helps to make use of already existing knowledge graphs. Many government organisations have made available high-quality taxonomies and semantic graphs by using semantic web standards. These can be picked up easily to extend them with own data and specific knowledge.

Semantic Knowledge Graphs will grow with your needs!
Standards-based semantics provide yet another advantage: it is becoming increasingly simpler to hire skilled people who have been working with standards like RDF, SKOS or SPARQL before. Even so, experienced knowledge engineers and data scientists are a comparatively rare species. Therefore it’s crucial to grow graphs and modelling skills over time. Starting with SKOS and extending an enterprise knowledge graph over time by introducing more schemes and by mapping to other vocabularies and datasets over time is a well established agile procedure model.

A graph-based semantic layer in enterprises can be expanded step-by-step, just like any other network. Analogous to a street network, start first with the main roads, introduce more and more connecting roads, classify streets, places, and intersections by a more and more distinguished classification system. It all comes down to an evolving semantic graph that will serve more and more as a map of your data, content and knowledge assets.

Semantic Knowledge Graphs and your Content Architecture
It’s a matter of fact that semantics serves as a kind of glue between unstructured and structured information and as a foundation layer for data integration efforts. But even for enterprises dealing mainly with documents and text-based assets, semantic knowledge graphs will do a great job.

Semantic graphs extend the functionality of a traditional search index. They don’t simply annotate documents and store occurrences of terms and phrases, they introduce concept-based indexing in contrast to term based approaches. Remember: semantics helps to identify the things behind the strings. The same applies to concept-based search over content repositories: documents get linked to the semantic layer, and therefore the knowledge graph can be used not only for typical retrieval but to classify, aggregate, filter, and traverse the content of documents.

PoolParty combines Machine Learning with Human Intelligence
Semantic knowledge graphs have the potential to innovate data and information management in any organisation. Besides questions around integrability, it is crucial to develop strategies to create and sustain the semantic layer efficiently.

Looking at the broad spectrum of semantic technologies that can be used for this endeavour, they range from manual to fully automated approaches. The promise to derive high-quality semantic graphs from documents fully automatically has not been fulfilled to date. On the other side, handcrafted semantics is error-prone, incomplete, and too expensive. The best solution often lies in a combination of different approaches. PoolParty combines Machine Learning with Human Intelligence: extensive corpus analysis and corpus learning support taxonomists, knowledge engineers and subject matter experts with the maintenance and quality assurance of semantic knowledge graphs and controlled vocabularies. As a result, enterprise knowledge graphs are more complete, up to date, and consistently used.

“An Enterprise without a Semantic Layer is like a Country without a Map.”

 

Read more…

Originally posted on Data Science Central

Original post published to DataScience+

In this post I will show how to collect data from a webpage and to analyze or visualize in R. For this task I will use the rvest package and will get the data from Wikipedia. I got the idea to write this post from Fisseha Berhane.
I will gain access to the prevalence of obesity in United States from Wikipedia page, then I will plot it in the map. Lets begin with loading the required packages.

## LOAD THE PACKAGES ####
library(rvest)
library(ggplot2)
library(dplyr)
library(scales)


Download the data from Wikipedia.

## LOAD THE DATA ####
obesity = read_html("https://en.wikipedia.org/wiki/Obesity_in_the_United_States")
obesity = obesity %>%
html_nodes("table") %>%
.[[1]]%>%
html_table(fill=T)


The first line of code is calling the data from Wikipedia and the second line of codes is transforming the table that we are interested into dataframe in R.
The head of our data.

head(obesity)
State and District of Columbia Obese adults Overweight (incl. obese) adults
1 Alabama 30.1% 65.4%
2 Alaska 27.3% 64.5%
3 Arizona 23.3% 59.5%
4 Arkansas 28.1% 64.7%
5 California 23.1% 59.4%
6 Colorado 21.0% 55.0%
Obese children and adolescents Obesity rank
1 16.7% 3
2 11.1% 14
3 12.2% 40
4 16.4% 9
5 13.2% 41
6 9.9% 51


The dataframe looks good, now we need to clean it from making ready to plot.

## CLEAN THE DATA ####
str(obesity)
'data.frame': 51 obs. of 5 variables:
$ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ Obese adults : chr "30.1%" "27.3%" "23.3%" "28.1%" ...
$ Overweight (incl. obese) adults: chr "65.4%" "64.5%" "59.5%" "64.7%" ...
$ Obese children and adolescents : chr "16.7%" "11.1%" "12.2%" "16.4%" ...
$ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...

# remove the % and make the data numeric
for(i in 2:4){
obesity[,i] = gsub("%", "", obesity[,i])
obesity[,i] = as.numeric(obesity[,i])
}
# check data again
str(obesity)
'data.frame': 51 obs. of 5 variables:
$ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ Obese adults : num 30.1 27.3 23.3 28.1 23.1 21 20.8 22.1 25.9 23.3 ...
$ Overweight (incl. obese) adults: num 65.4 64.5 59.5 64.7 59.4 55 58.7 55 63.9 60.8 ...
$ Obese children and adolescents : num 16.7 11.1 12.2 16.4 13.2 9.9 12.3 14.8 22.8 14.4 ...
$ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...


Fix the names of variables by removing the spaces.

names(obesity)
[1] "State and District of Columbia" "Obese adults"
[3] "Overweight (incl. obese) adults" "Obese children and adolescents"
[5] "Obesity rank"

names(obesity) = make.names(names(obesity))
names(obesity)
[1] "State.and.District.of.Columbia" "Obese.adults"
[3] "Overweight..incl..obese..adults" "Obese.children.and.adolescents"
[5] "Obesity.rank"


Now, it's time to load the map data.

# load the map data
states = map_data("state")
str(states)
'data.frame': 15537 obs. of 6 variables:
$ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
$ lat : num 30.4 30.4 30.4 30.3 30.3 ...
$ group : num 1 1 1 1 1 1 1 1 1 1 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ region : chr "alabama" "alabama" "alabama" "alabama" ...
$ subregion: chr NA NA NA NA ...


Merge two datasets (obesity and states) by region, therefore we first need to create a new variable (region) in obesity dataset.

# create a new variable name for state
obesity$region = tolower(obesity$State.and.District.of.Columbia)


Merge the datasets.

states = merge(states, obesity, by="region", all.x=T)
str(states)
'data.frame': 15537 obs. of 11 variables:
$ region : chr "alabama" "alabama" "alabama" "alabama" ...
$ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
$ lat : num 30.4 30.4 30.4 30.3 30.3 ...
$ group : num 1 1 1 1 1 1 1 1 1 1 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ subregion : chr NA NA NA NA ...
$ State.and.District.of.Columbia : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ Obese.adults : num 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ...
$ Overweight..incl..obese..adults: num 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 ...
$ Obese.children.and.adolescents : num 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 ...
$ Obesity.rank : int 3 3 3 3 3 3 3 3 3 3 ...

Plot the data


Finally we will plot the prevalence of obesity in adults.

## MAKE THE PLOT ####
# adults
ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Prevalence of Obesity in Adults") +
coord_map()


Here is the plot in adults:
adults
Similarly, we can plot the prevalence of obesity in children.

# children
ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.children.and.adolescents)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Prevalence of Obesity in Children") +
coord_map()


Here is the plot in children:
children
If you like to show the name of State in the map use the code below to create a new dataset.

statenames = states %>% 
group_by(region) %>%
summarise(
long = mean(range(long)),
lat = mean(range(lat)),
group = mean(group),
Obese.adults = mean(Obese.adults),
Obese.children.and.adolescents = mean(Obese.children.and.adolescents)
)


After you add this code to ggplot code above

geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)


That's all. I hope you learned something useful today.

Read more…

Guest blog post by Klodian

Original post is published at DataScience+

Recently, I become interested to grasp the data from webpages, such as Wikipedia, and to visualize it with R. As I did in my previous post, I use rvest package to get the data from webpage and ggplot package to visualize the data.
In this post, I will map the life expectancy in White and African-American in US.
Load the required packages.

## LOAD THE PACKAGES ####
library(rvest)
library(ggplot2)
library(dplyr)
library(scales)

Import the data from Wikipedia.

## LOAD THE DATA ####
le = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_life_expectancy")
le = le %>%
html_nodes("table") %>%
.[[2]]%>%
html_table(fill=T)

Now I have to clean the data. Below I have explain the role of each code.

## CLEAN THE DATA ####
# select only columns with data
le = le[c(1:8)]
# get the names from 3rd row and add to columns
names(le) = le[3,]
# delete rows and columns which I am not interested
le = le[-c(1:3), ]
le = le[, -c(5:7)]
# rename the names of 4th and 5th column
names(le)[c(4,5)] = c("le_black", "le_white")
# make variables as numeric
le = le %>%
mutate(
le_black = as.numeric(le_black),
le_white = as.numeric(le_white))

Since there are some differences in life expectancy between White and African-American, I will calculate the differences and will map it.

le = le %>% mutate(le_diff = (le_white - le_black))

I will load the map data and will merge the datasets togather.
## LOAD THE MAP DATA ####
states = map_data("state")
# create a new variable name for state
le$region = tolower(le$State)
# merge the datasets
states = merge(states, le, by="region", all.x=T)

Now its time to make the plot. First I will plot the life expectancy in African-American in US. For few states we don't have the data, and therefore I will color it in grey color.
## MAKE THE PLOT ####
# Life expectancy in African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in African American") +
coord_map()

Here is the plot:
Le_african_american

The code below is for White people in US.

# Life expectancy in White American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_white)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="Gray", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in White") +
coord_map()

Here is the plot:
Le_white

Finally, I will map the differences between white and African American people in US.

# Differences in Life expectancy between White and African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_diff)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Differences in Life Expectancy between \nWhite and African Americans by States in US") +
coord_map()

Here is the plot:
Le_differences
On my previous post I got a comment to add the pop-up effect as I hover over the states. This is a simple task as Andrea exmplained in his comment. What you have to do is to install the plotly package, to create a object for ggplot, and then to use this function ggplotly(map_plot) to plot it.

library(plotly)
map_plot = ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in African American") +
coord_map()
ggplotly(map_plot)

Here is the plot:
le_plotly
Thats all! Leave a comment below if you have any question.

Original post: Map the Life Expectancy in United States with data from Wikipedia

Read more…

13 Great Data Science Infographics

Originally posted on Data Science Central

Most of these infographics are tutorials covering various topics in big data, machine learning, visualization, data science, Hadoop, R or Python, typically intended for beginners. Some are cheat sheets and can be nice summaries for professionals with years of experience. Some, popular a while back (you will find one example here) were designed as periodic tables.

For Geeks 

For Business People

Infographics Repositories

Read more…

Twitter Analytics using Tweepsmap

Guest blog post by Salman Khan

This morning I saw #tweepsmap on my twitter feed and decided to check it out. Tweepsmap is a a neat tool that can analyze any twitter account from a social network perspective. It can create interactive maps showing where the followers of a twitter account reside , segment followers  and even show who unfollowed you!

Here is my Followers map generated by country.

You can create the followers map based on city and state as well.

Tweepsmap also provides demographic information such as languages, occupation and gender but it relies on the twitter user having entered this information in the twitter profile.

There is also a hashtag and keyword analyzer that reports on most prolific tweeters, locations of tweets, tweets vs. retweets and so on. I used their free report which is built for a maximum of 100 tweets to analyze the trending hashtag –> #BeautyAndTheBeast. For some reasons, the #BeautyAndTheBeast hashtag is really popular in Brazil, out of the 100 tweets, 26 were from Brazil and 20 from USA.  You can see that 3 out of 5 of the top influencers with most followers are tweeting in Portuguese. Other visualizations included  the tweets vs. retweet numbers and the distribution reach of the tweeters. I was even able to get make the report public so you can check it out here.  Remember it only analyzes 100 tweets so don’t draw any conclusions from it !

                                                  

 If you are doing research on social media or are a business that wants to learn more about competitors and customers, tweepsmap helps you analyze specific twitter accounts as well! Of course, we all know there is no such thing as a free lunch , so this is a paid feature!

From what I saw by tinkering with the pricing calculator on their page, the analysis of a twitter account with more than 2.5M followers will cost a flat fee of $5K. I tried a few twitter accounts to see how much each would cost based on number of followers and found that the cost per follower was $0.002.  So if you wanted to get twitter data on Hans Rosling, it would cost you $642 as he has 320,956 followers (642/320,956 = 0.002).

 calculate3  calculate5   calculate4  calculate6 calculate7 calculate8calculate9       calculate2

Overall this looks like a neat tool to get started when analyzing twitter data and using this information to maximize the returns on your tweets. I have only mentioned a few of their tools above; they have other features like the Best Time to Tweet which will analyze your audience, twitter history, time zones and so on to predict when you get the most out of your tweet. Check out their website for more info here

Read more…

Originally posted on Data Science Central

Written by sought-after speaker, designer, and researcher Stephanie D. H. Evergreen, Effective Data Visualizationshows readers how to create Excel charts and graphs that best communicate data findings. This comprehensive how-to guide functions as a set of blueprints—supported by research and the author’s extensive experience with clients in industries all over the world—for conveying data in an impactful way. Delivered in Evergreen’s humorous and approachable style, the book covers the spectrum of graph types available beyond the default options, how to determine which one most appropriately fits specific data stories, and easy steps for making the chosen graph in Excel. 

The book is available, here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Max Wegner

What’s the first thing you think of when you hear the phrase “artificial intelligence”? Perhaps it’s the HAL 9000 from 2001: A Space Odyssey, or maybe it’s chess Grandmaster Garry Kasparov losing to IBM’s Deep Blue supercomputer. While those are indeed examples of artificial intelligence, examples of AI in the real world of today are a bit more mundane and a whole lot less sinister.

In fact, many of us use AI, in one form or another, in our everyday lives. The personal assistant on your smartphone that helps you locate information, the facial recognition software on Facebook photos, and even the gesture control on your favourite video game are all examples of practical AI applications. Rather than being a part of a dystopian world view in which the machines take over, current AI makes our lives a whole lot more convenient by carrying out simple tasks for us.

What’s more, there’s a lot of money flowing into a lot of companies working on AI developments. This means that in the near future, we could see even more practical uses for AI, from smart robots to smart drones and more.

To give you a better understanding of the current state of AI, our friends at appcessories.co.uk have put together this helpful Artificial Intelligence infographic. It will give you the full rundown, from categories to geography to finances. Check it out, and you’ll see why AI is so essential to our everyday lives, and why the future of AI looks so bright.

Read more…

Originally posted on Data Science Central

This article was posted by Bethany Cartwright. Bethany is the blog team's Data Visualization Intern. She spends most of her time creating infographics and other visuals for blog posts.

Whether you’re writing a blog post, putting together a presentation, or working on a full-length report, using data in your content marketing strategy is a must. Using data helps enhance your arguments by make your writing more compelling. It gives your readers context. And it helps provide support for your claims.

That being said, if you’re not a data scientist yourself, it can be difficult to know where to look for data and how to best present that data once you’ve got it. To help, below you'll find the tools and list of resources you need to source credible data and create some stunning visualizations. 

Resources for Uncovering Credible Data

When looking for data, it’s important to find numbers that not only look good, but are also credible and reliable.

The following resources will point you in the direction of some credible sources to get you started, but don’t forget to fact-check everything you come across. Always ask yourself: Is this data original, reliable, current, and comprehensive?

Tools for Creating Data Visualizations

Now that you know where to find credible data, it’s time to start thinking about how you’re going to display that data in a way that works for your audience.

At its core, data visualization is the process of turning basic facts and figures into a digestible image --  whether it’s a chart, graph, timeline, map, infographic, or other type of visual. 

While understanding the theory behind data visualization is one thing, you also need the tools and resources to make digital data visualization possible. Below we’ve collected 10 powerful tools for you to browse, bookmark, or download to make designing data visuals even easier for your business.

To check all this information, click hereFor more articles about data visualization, click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Mike Waldron

Originally posted on Data Science Central

This blog was originally published on the AYLIEN Text Analysis blog

We wanted to gather and analyze news content in order to look for similarities and differences in the way two journalists write headlines for their respective news articles and blog posts. The two reporters we selected operate in, and write about, two very different industries/topics and have two very different writing styles:

  • Finance: Akin Oyedele of Business Insider, who covers market updates.
  • Celebrity: Carly Ledbetter of the Huffington Post, who mainly writes about celebrities.

Note: For a more technical, in-depth and interactive representation of this project, check out the Jupyter notebook we created. This includes sample code and more in depth descriptions of our approach.

The Approach

  1. Collect news headlines from both of our journalists
  2. Create parse trees from collected headlines (we explain parse trees below!)
  3. Extract information from each parse tree that is indicative of the overall headline structure
  4. Define a simple sequence similarity metric to quantitatively compare any pair of headlines
  5. Apply the same metric to all headlines collected for each author to find similarity
  6. Use K-Means and tSNE to produce a visual map of all the headlines so we can clearly see the differences between our two journalists

Creating Parse Trees

In linguistics, a parse tree is a rooted tree that represents the syntactic structure of a sentence, according to some pre-defined grammar. Here’s an example;

For example with a simple sentence like “The cat sat on the mat”, a parse tree might look like this;



Thankfully parsing our extracted headlines isn’t too difficult. We used the Pattern Library for Python to parse the headlines and generate our parse trees.

Data

In total we gathered about 700 article headlines for both journalists using the AYLIEN News API which we then analyzed using Python. If you’d like to give it a go yourself, you can grab the Pickled data files directly from the GitHub repository (link), or by using the data collection notebook we prepared for this project.

First we loaded all the headlines for Akin Oyedele, then we created parse trees for all 700 of them, and finally we stored them together with some basic information about the headline in the same Python object.

Then using a sequence similarity metric, we compared all of these headlines two by two, to build a similarity matrix.

Visualizations

To visualize headline similarities for Akin we generated a 2D scatter plot with the hope of grouping similarly structured headlines close together in a graph in groups of sorts.

To achieve this, we first reduced the dimensionality of our similarity matrix using tSNE and applied K-Means clustering to find groups of similar headlines. We also used some a nice viz library which we’ve outlined below;

  • tSNE to reduce the dimensionality of our similarity matrix from 700 down to 2
  • K-Means to identify 5 clusters of similar headlines and add some color
  • Plotted the actual chart using Bokeh



The chart above shows a number of dense groups of headlines, as well as some sparse ones. Each dot on the graph represents a headline, as you can see when you hover over one in the interactive version. Similar titles are as you can see grouped together quite cleanly. Some of the more stand-outish groups are;

  • The circular group left of center typically consists of short, snappy stock update headlines such as “Viacom is crashing”
  • The large circular group on the top right are mostly announcement-style headlines such as “Here come the…. “ formats.
  • The small green circular group towards the bottom left are similar and use the same phrases we see headlines such as “Industrial production falls more than expected” or “ADP private payrolls rise more than expected”.

Comparing the two authors

By repeating the process for our second journalist, Carly Ledbetter, we were then able to compare both authors and see how many common patterns exist between the two in terms of how they write their headlines.

We observed that roughly 50% (347/700) of the headlines had a similar structure.


Here we can see the same dense and sparse patterns, as well as groups of points that are somewhat unique to each author, or shared by both authors. The yellow dots represent our Celebrity focused author and the blue our finance guy.

  • The bottom right cluster is almost exclusive to the first author, as it covers the short financial/stock report headlines such as “Here comes CPI”, but it also covers some of the headlines from the first author such as “There’s Another Leonardo DiCaprio Doppelgänger”. Same could be said about the top middle cluster.
  • The top right cluster mostly contains single-verb headlines about celebrities doing things, such as “Kylie Jenner Graces Coachella With Her Peachy Presence” or “Kate Hudson Celebrated Her Birthday With A Few Shirtless Men” but it also includes market report headlines from the first author such as “Oil rig count plunges for 7th straight week”.

Conclusion and future work

In this project we’ve shown how you can retrieve and analyze news headlines, evaluate their structure and similarity, and visualize the results on an interactive map.

While we were quite happy with the results and found it quite interesting there were some areas that we thought could be improved. Some of the weaknesses of our approach, and ways to improve them are:

 - Using entire parse trees instead of just the chunk types

 - Using a tree or graph similarity metric instead of a sequence similarity one (ideally a linguistic-aware one too)

 - Better pre-processing to identify and normalize Named Entities, etc.

Next up..

In our next post, we’re going to study the correlations between various headline structures and some external metrics like number of Shares and Likes on Social Media platforms, and see if we can uncover any interesting patterns. We can hazard a guess already that the short, snappy Celebrity style headlines would probably get the most shares and reach on social media, but there’s only one way to find out.

If you’d like to access the data used or want to see the sample code we used head over to our Jupyter notebook.

Read more…

Guest blog post by Jeff Pettiross

For almost as long as we have been writing, we’ve been putting meaning into maps, charts, and graphs. Some 1,300 years ago, Chinese astronomers recorded the position of the stars and the shapes of the constellations. The Dunhuang star maps are the oldest preserved atlas of the sky:

More than 500 years ago, the residents of the Marshall Islands learned to navigate the surrounding waters by canoe in the daytime—without the aid of stars. These master rowers learned to recognize the feel of the currents reflecting off the nearby islands. They visualized their insights on maps made of sticks, rocks, and shells.

In the 1800s, Florence Nightingale used charts to explain to government officials how treatable diseases were killing more soldiers in the Crimean War than battle wounds. She knew that pictures would tell a more powerful story than numbers alone:

Why Visualized Data Is So Powerful

Since long before spreadsheets and graphing software, we have communicated data through pictures. But we’ve only begun, in the last half-century, to understand why visualizations are such effective tools for seeing and understanding data.

It starts with the part of your brain called the visual cortex. Located near the bony lump at the back of your skull, it processes input from your eyes. Thanks to the visual cortex, our sense of sight provides information much faster than the other senses. We actually begin to process what we see before we think about it.

This is sound from an evolutionary perspective. The early human who had to stop and think, “Hmm, is that a jaguar sprinting toward me?” probably didn’t survive to pass on their genes. There is a biological imperative for our sense of sight to override cognition—in this case, for us to pay sharp attention to movement in our peripheral vision.

Today, our sight is more likely to save us on a busy street than on the savannah. Moving cars and blinking lights activate the same peripheral attention, helping us navigate a complicated visual environment. We see other cues on the street, too. Bright orange traffic cones mark hazards. Signs call out places, directions, and warnings. Vertical stripes on the street indicate lanes while horizontal lines indicate stop lines.

We have designed a rich, visual system that drivers can comprehend quickly, thanks to perceptual psychology. Our visual cortex is attuned to color hues (like safety orange), position (signs placed above road), and line orientation (lanes versus stop lines). Research has identified other visual features. Size, clustering, and shape also help us perceive our environment almost immediately.

What This Means for Us Today

Fortunately, our offices and homes tend to be safer than the savannah or the highway. Regardless, our lightning-quick sense of vision jumps into action even when we read email, tweets, or websites. And that right there is why data visualization communicates so powerfully and immediately: It takes advantage of these visual features, too.

A line graph immediately reveals upward or downward changes, thanks to the orientation of each segment. The axes of the graph use position to communicate values in relationship to each other. If there are multiple, colored lines, the color hue lets us rapidly tell the lines apart, no matter how many times they cross. Bar charts, maps with symbols, area graphs—these all use the visual superhighway in our brains to communicate meaning.

The early pioneers of data visualization were led by their intuition to use visual features like position, clustering, and hue. The longevity of those works is a testament to their power.

We now have software to help us visualize data and to turn tables of facts and figures into meaningful insights. That means anyone, even non-experts, can explore data in a way that wasn’t possible even 20 years ago. We can, all of us, analyze the world’s growing volume of data, spot trends and outliers, and make data-driven decisions.

Today, we don’t just have charts and graphs; we have the science behind them. We have started to unlock the principles of perception and cognition so we can apply them in new ways and in various combinations. A scatter plot can leverage position, hue, and size to visualize data. Its data points can interactively filter related charts, allowing the user to shift perspectives in their analysis by simply clicking on a point. Animating transitions as users pivot from one idea to the next brings previously hidden differences to the foreground. We’re building on the intuition of the pioneers and the conclusions of science to make analysis faster, easier, and more intuitive.

When humanity unlocked the science behind fire and magnets, we learned to harness chemistry and physics in new ways. And we revolutionized the world with steam engines and electrical generators.

Humanity is now at the dawn of a new revolution, and intuitive tools are putting the beautiful science of data visualization into the hands of millions of users.

I’m excited to see where you take all of us next.

Note: This post first appeared in VentureBeat.

Read more…

Investigating Airport Connectedness

Guest blog post by SupStat

Contributed by the neuroscientist Sricharan Maddineni. He holds huge passion and talents in data science. Thus he took NYC Data Science Academy 12 weeks boot camp program  between Jan 11th to Apr 1st, 2016. The post was based on his second project, which posted on February 16th (due at 4th week of the program). He acquired the publicly transportation data and consult from social media. Consuming the data through his mind, he visualized the economic and business insights.

Why Are Airports Important?

(Photo by theprospect.net)

Aviation infrastructure has been a bedrock of the United States economy and culture for many decades, and it was the first instrument through which we connected with the world. Before the invention of flight, humans were inexorably confined by the immenseness of Earth's oceans.

All the disdain and unpleasantries we endure on flights are quickly forgotten once we safely land at our destinations and realize we have just been transported to a new place on our vast planet. Every time I have flown and landed in a new country or city, I am overwhelmed with feelings of how beautiful our world is and how much I wish I could visit every corner of our planet. My love of aviation has led me to investigate the connectedness of United States airports and the passenger-disparity between the developed and developing countries.

The App

The interactive map can be used as a tool to investigate the connectedness of the US airports. Users can choose from a list of airports including LAX, JFK, IAD and more to visualize the connections out of that airport. The 'Airport Connections' table shows us the combinations of connections by Airline Carrier. For example, we can see that American Airlines (AA) had 8058 flights out of LAX to JFK (2009 dataset). The 'Carriers' table shows us the total flights out of LAX by American Airlines (76,670).

If we select Hartsfield-Jackson Atlanta International, we see that it is the most connected airport in the United States. *Please note that I am not plotting all possible connections, just major airport connections and only within the United States (the map would be filled solid if I plotted all connections!). The size of the airport bubble is calculated by the number of connections. Therefore, all large bubbles are international airports, and smaller bubbles are regional/domestic airports.

I also plotted Voronoi tesselations between the airports using one nearest neighbor to show the area differences between airports in the Eastcoast/Westcoast/Midwest. The largest polygons are found in the Midwest because airports are far apart in all directions. These airports are generally more connected as well since they are connecting the east and west coast (see Denver International or Salt Lake City International). Clicking on a Voronoi polygon brings up the nearest airport within that area.

Why is it important for countries to improve their airport infrastructure?

Looking at the Motion/Bubble Chart, we observe that developing countries travel horizontally whereas developed countries travel vertically. This indicates that developed countries populations have remained steady, but they have seen a rise in passenger travelers. On the flip side, developing countries have seen their populations boom, but the number of air travelers has remained stagnant.

Most importantly, countries moving upward show noticeable gains in GDP whereas countries moving horizontally show minimal gains over the last four decades (GDP is represented by the size of the bubble). We can also notice that airline passenger counts plunge during recessions for first world countries but remain comparatively steady for developing countries (1980, 2000, 2009). We can interpret this to mean that developing countries are not as connected to the rest of the world since their economies are unaffected by global economic crises.

Passenger Counts during weekends and Holidays

The calendar heatmap shows us the Daily flight count in the United States. We can recognize that airlines operate significantly fewer flights on Saturdays and National Holidays such as July 4th and Thanksgiving. The days leading up to and after National Holidays show an increase in flights as expected. Looking carefully, you can also notice there are fewer flights on Tuesdays and Wednesdays, and there are more flights during the summer season.

If you select a day on the calendar, a table shows us the top 20 Airline carrier flight counts on that day. Southwest, American Airlines, SkyWest, and Delta seem to operate the most airlines in the United States.

            

The Data

1. Interactive Map

I utilized comprehensive datasets provided by the United States Department of Transportation and Open Data by Socrata that allowed me to map airport connections in the United States. The first airport dataset included airport locations (city/state) and their latitude and longitude degrees, and the second dataset included the airport connections (LAX - JFK, LAX-SFO, ...). First, I used these datasets to calculate the size of the airport based on how many connections each had.

https://github.com/nycdatasci/bootcamp004_project/tree/master/Project2-Shiny/Sri_New

2. Motion Chart

The second analysis was done using the airline passenger, population, and GDP numbers for the world's countries over the last 45 years. Most of the work here was in transforming the three datasets provided by the World Bank from wide to long. See the code below.

https://github.com/nycdatasci/bootcamp004_project/tree/master/Project2-Shiny/Sri_New

3. Calendar Chart

Lastly, I used the Transtats database to obtain the daily flight counts by Airline Carrier for the years 2004-2007. Some transformation was done to create two separate data frames - flight counts per day and flight counts per carrier. While trying to calculate flight counts by day, I tried this code:

f2007_2 <- f2007 %>% group_by(UniqueCarrier, month) %>% summarise(sum = n()) 

I knew there as an error by looking at the resulting heatmap, but I didn't realize this was showing me a cumulative sum by month rather than the daily flight count, so I hit twitter to see if I could get help diagnosing my problem. I tweeted Jeff Weis who appeared as the Aviation Analyst on CNN during the Malaysian Airlines MH370 disappearance, and he caught my mistake! After he had pushed me in the right direction, I corrected my code to: 

group_by(UniqueCarrier, date) %>% summarise(count = n()) 

The Code

Creating Voronoi Polygons

Connection Lines

The second step was creating the line connections between the airports. To do this, I used the polylines function in Leaflet to add connecting lines between airports filtered by user input. input$Input1 catches the user selected airport and subsets the dataset by all origin airports that equal the selected airport. The gcIntermediate function makes those lines curved.

Calendar json capture

The calendar chart required two parameters, the whichdatevar reads the date column, and numvar which plots the value for each day on the calendar. Then I utilized a gvis.listener.jscode method to capture the user selected date and filter the dataset for the table.

To experience Sricharan Maddineni the interactive Shiny App

Read more…

Guest blog post by Irina Papuc

Machine Learning (ML) is coming into its own, with a growing recognition that ML can play a key role in a wide range of critical applications, such as data mining, natural language processing, image recognition, and expert systems. ML provides potential solutions in all these domains and more, and is set to be a pillar of our future civilization.

The supply of able ML designers has yet to catch up to this demand. A major reason for this is that ML is just plain tricky. This tutorial introduces the basics of Machine Learning theory, laying down the common themes and concepts, making it easy to follow the logic and get comfortable with the topic.

What is Machine Learning?

So what exactly is “machine learning” anyway? ML is actually a lot of things. The field is quite vast and is expanding rapidly, being continually partitioned and sub-partitioned ad nauseam into different sub-specialties and types of machine learning.

There are some basic common threads, however, and the overarching theme is best summed up by this oft-quoted statement made by Arthur Samuel way back in 1959: “[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.”

And more recently, in 1997, Tom Mitchell gave a “well-posed” definition that has proven more useful to engineering types: “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” -- Tom Mitchell, Carnegie Mellon University

So if you want your program to predict, for example, traffic patterns at a busy intersection (task T), you can run it through a machine learning algorithm with data about past traffic patterns (experience E) and, if it has successfully “learned”, it will then do better at predicting future traffic patterns (performance measure P).

The highly complex nature of many real-world problems, though, often means that inventing specialized algorithms that will solve them perfectly every time is impractical, if not impossible. Examples of machine learning problems include, “Is this cancer?”, “What is the market value of this house?”, “Which of these people are good friends with each other?”, “Will this rocket engine explode on take off?”, “Will this person like this movie?”, “Who is this?”, “What did you say?”, and “How do you fly this thing?”. All of these problems are excellent targets for an ML project, and in fact ML has been applied to each of them with great success.

ML solves problems that cannot be solved by numerical means alone.

Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning:

  • Supervised machine learning: The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data.

  • Unsupervised machine learning: The program is given a bunch of data and must find patterns and relationships therein.

We will primarily focus on supervised learning here, but the end of the article includes a brief discussion of unsupervised learning with some links for those who are interested in pursuing the topic further.

Supervised Machine Learning

In the majority of supervised learning applications, the ultimate goal is to develop a finely tuned predictor function h(x) (sometimes called the “hypothesis”). “Learning” consists of using sophisticated mathematical algorithms to optimize this function so that, given input data x about a certain domain (say, square footage of a house), it will accurately predict some interesting value h(x) (say, market price for said house).

In practice, x almost always represents multiple data points. So, for example, a housing price predictor might take not only square-footage (x1) but also number of bedrooms (x2), number of bathrooms (x3), number of floors (x4), year built (x5), zip code (x6), and so forth. Determining which inputs to use is an important part of ML design. However, for the sake of explanation, it is easiest to assume a single input value is used.

So let’s say our simple predictor has this form:

where and are constants. Our goal is to find the perfect values of and to make our predictor work as well as possible.

Optimizing the predictor h(x) is done using training examples. For each training example, we have an input value x_train, for which a corresponding output, y, is known in advance. For each example, we find the difference between the known, correct value y, and our predicted value h(x_train). With enough training examples, these differences give us a useful way to measure the “wrongness” of h(x). We can then tweak h(x) by tweaking the values of and to make it “less wrong”. This process is repeated over and over until the system has converged on the best values for and . In this way, the predictor becomes trained, and is ready to do some real-world predicting.

A Simple Machine Learning Example

We stick to simple problems in this post for the sake of illustration, but the reason ML exists is because, in the real world, the problems are much more complex. On this flat screen we can draw you a picture of, at most, a three-dimensional data set, but ML problems commonly deal with data with millions of dimensions, and very complex predictor functions. ML solves problems that cannot be solved by numerical means alone.

With that in mind, let’s look at a simple example. Say we have the following training data, wherein company employees have rated their satisfaction on a scale of 1 to 100:

First, notice that the data is a little noisy. That is, while we can see that there is a pattern to it (i.e. employee satisfaction tends to go up as salary goes up), it does not all fit neatly on a straight line. This will always be the case with real-world data (and we absolutely want to train our machine using real-world data!). So then how can we train a machine to perfectly predict an employee’s level of satisfaction? The answer, of course, is that we can’t. The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.

It is somewhat reminiscent of the famous statement by British mathematician and professor of statistics George E. P. Box that “all models are wrong, but some are useful”.

The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.

ML builds heavily on statistics. For example, when we train our machine to learn, we have to give it a statistically significant random sample as training data. If the training set is not random, we run the risk of the machine learning patterns that aren’t actually there. And if the training set is too small (see law of large numbers), we won’t learn enough and may even reach inaccurate conclusions. For example, attempting to predict company-wide satisfaction patterns based on data from upper management alone would likely be error-prone.

With this understanding, let’s give our machine the data we’ve been given above and have it learn it. First we have to initialize our predictor h(x) with some reasonable values of and . Now our predictor looks like this when placed over our training set:

If we ask this predictor for the satisfaction of an employee making $60k, it would predict a rating of 27:

It’s obvious that this was a terrible guess and that this machine doesn’t know very much.

So now, let’s give this predictor all the salaries from our training set, and take the differences between the resulting predicted satisfaction ratings and the actual satisfaction ratings of the corresponding employees. If we perform a little mathematical wizardry (which I will describe shortly), we can calculate, with very high certainty, that values of 13.12 for and 0.61 for are going to give us a better predictor.

And if we repeat this process, say 1500 times, our predictor will end up looking like this:

At this point, if we repeat the process, we will find that and won’t change by any appreciable amount anymore and thus we see that the system has converged. If we haven’t made any mistakes, this means we’ve found the optimal predictor. Accordingly, if we now ask the machine again for the satisfaction rating of the employee who makes $60k, it will predict a rating of roughly 60.

Now we’re getting somewhere.

A Note on Complexity

The above example is technically a simple problem of univariate linear regression, which in reality can be solved by deriving a simple normal equation and skipping this “tuning” process altogether. However, consider a predictor that looks like this:

This function takes input in four dimensions and has a variety of polynomial terms. Deriving a normal equation for this function is a significant challenge. Many modern machine learning problems take thousands or even millions of dimensions of data to build predictions using hundreds of coefficients. Predicting how an organism’s genome will be expressed, or what the climate will be like in fifty years, are examples of such complex problems.

Many modern ML problems take thousands or even millions of dimensions of data to build predictions using hundreds of coefficients.

Fortunately, the iterative approach taken by ML systems is much more resilient in the face of such complexity. Instead of using brute force, a machine learning system “feels its way” to the answer. For big problems, this works much better. While this doesn’t mean that ML can solve all arbitrarily complex problems (it can’t), it does make for an incredibly flexible and powerful tool.

Gradient Descent - Minimizing “Wrongness”

Let’s take a closer look at how this iterative process works. In the above example, how do we make sure and are getting better with each step, and not worse? The answer lies in our “measurement of wrongness” alluded to previously, along with a little calculus.

The wrongness measure is known as the cost function (a.k.a., loss function), . The input represents all of the coefficients we are using in our predictor. So in our case, is really the pair and . gives us a mathematical measurement of how wrong our predictor is when it uses the given values of and .

The choice of the cost function is another important piece of an ML program. In different contexts, being “wrong” can mean very different things. In our employee satisfaction example, the well-established standard is the linear least squares function:

With least squares, the penalty for a bad guess goes up quadratically with the difference between the guess and the correct answer, so it acts as a very “strict” measurement of wrongness. The cost function computes an average penalty over all of the training examples.

So now we see that our goal is to find and for our predictor h(x) such that our cost function is as small as possible. We call on the power of calculus to accomplish this.

Consider the following plot of a cost function for some particular ML problem:

Here we can see the cost associated with different values of and . We can see the graph has a slight bowl to its shape. The bottom of the bowl represents the lowest cost our predictor can give us based on the given training data. The goal is to “roll down the hill”, and find and corresponding to this point.

This is where calculus comes in to this machine learning tutorial. For the sake of keeping this explanation manageable, I won’t write out the equations here, but essentially what we do is take the gradient of , which is the pair of derivatives of (one over and one over ). The gradient will be different for every different value of and , and tells us what the “slope of the hill is” and, in particular, “which way is down”, for these particular s. For example, when we plug our current values of into the gradient, it may tell us that adding a little to and subtracting a little from will take us in the direction of the cost function-valley floor. Therefore, we add a little to , and subtract a little from , and voilà! We have completed one round of our learning algorithm. Our updated predictor, h(x) = + x, will return better predictions than before. Our machine is now a little bit smarter.

This process of alternating between calculating the current gradient, and updating the s from the results, is known as gradient descent.



That covers the basic theory underlying the majority of supervised Machine Learning systems. But the basic concepts can be applied in a variety of different ways, depending on the problem at hand.

Classification Problems

Under supervised ML, two major subcategories are:

  • Regression machine learning systems: Systems where the value being predicted falls somewhere on a continuous spectrum. These systems help us with questions of “How much?” or “How many?”.

  • Classification machine learning systems: Systems where we seek a yes-or-no prediction, such as “Is this tumer cancerous?”, “Does this cookie meet our quality standards?”, and so on.

As it turns out, the underlying theory is more or less the same. The major differences are the design of the predictor h(x) and the design of the cost function .

Our examples so far have focused on regression problems, so let’s now also take a look at a classification example.

Here are the results of a cookie quality testing study, where the training examples have all been labeled as either “good cookie” (y = 1) in blue or “bad cookie” (y = 0) in red.

In classification, a regression predictor is not very useful. What we usually want is a predictor that makes a guess somewhere between 0 and 1. In a cookie quality classifier, a prediction of 1 would represent a very confident guess that the cookie is perfect and utterly mouthwatering. A prediction of 0 represents high confidence that the cookie is an embarrassment to the cookie industry. Values falling within this range represent less confidence, so we might design our system such that prediction of 0.6 means “Man, that’s a tough call, but I’m gonna go with yes, you can sell that cookie,” while a value exactly in the middle, at 0.5, might represent complete uncertainty. This isn’t always how confidence is distributed in a classifier but it’s a very common design and works for purposes of our illustration.

It turns out there’s a nice function that captures this behavior well. It’s called the sigmoid function, g(z), and it looks something like this:

z is some representation of our inputs and coefficients, such as:

so that our predictor becomes:

Notice that the sigmoid function transforms our output into the range between 0 and 1.

The logic behind the design of the cost function is also different in classification. Again we ask “what does it mean for a guess to be wrong?” and this time a very good rule of thumb is that if the correct guess was 0 and we guessed 1, then we were completely and utterly wrong, and vice-versa. Since you can’t be more wrong than absolutely wrong, the penalty in this case is enormous. Alternatively if the correct guess was 0 and we guessed 0, our cost function should not add any cost for each time this happens. If the guess was right, but we weren’t completely confident (e.g. y = 1, but h(x) = 0.8), this should come with a small cost, and if our guess was wrong but we weren’t completely confident (e.g. y = 1 but h(x) = 0.3), this should come with some significant cost, but not as much as if we were completely wrong.

This behavior is captured by the log function, such that:

Again, the cost function gives us the average cost over all of our training examples.

So here we’ve described how the predictor h(x) and the cost function differ between regression and classification, but gradient descent still works fine.

A classification predictor can be visualized by drawing the boundary line; i.e., the barrier where the prediction changes from a “yes” (a prediction greater than 0.5) to a “no” (a prediction less than 0.5). With a well-designed system, our cookie data can generate a classification boundary that looks like this:

Now that’s a machine that knows a thing or two about cookies!

An Introduction to Neural Networks

No discussion of ML would be complete without at least mentioning neural networks. Not only do neural nets offer an extremely powerful tool to solve very tough problems, but they also offer fascinating hints at the workings of our own brains, and intriguing possibilities for one day creating truly intelligent machines.

Neural networks are well suited to machine learning problems where the number of inputs is gigantic. The computational cost of handling such a problem is just too overwhelming for the types of systems we’ve discussed above. As it turns out, however, neural networks can be effectively tuned using techniques that are strikingly similar to gradient descent in principle.

A thorough discussion of neural networks is beyond the scope of this tutorial, but I recommend checking out our previous post on the subject.

Unsupervised Machine Learning

Unsupervised learning typically is tasked with finding relationships within data. There are no training examples used in this process. Instead, the system is given a set data and tasked with finding patterns and correlations therein. A good example is identifying close-knit groups of friends in social network data.

The algorithms used to do this are very different from those used for supervised learning, and the topic merits its own post. However, for something to chew on in the meantime, take a look at clustering algorithms such as k-means, and also look into dimensionality reduction systems such as principle component analysis. Our prior post on big data discusses a number of these topics in more detail as well.

Conclusion

We’ve covered much of the basic theory underlying the field of Machine Learning here, but of course, we have only barely scratched the surface.

Keep in mind that to really apply the theories contained in this introduction to real life machine learning examples, a much deeper understanding of the topics discussed herein is necessary. There are many subtleties and pitfalls in ML, and many ways to be lead astray by what appears to be a perfectly well-tuned thinking machine. Almost every part of the basic theory can be played with and altered endlessly, and the results are often fascinating. Many grow into whole new fields of study that are better suited to particular problems.

Clearly, Machine Learning is an incredibly powerful tool. In the coming years, it promises to help solve some of our most pressing problems, as well as open up whole new worlds of opportunity. The demand for ML engineers is only going to continue to grow, offering incredible chances to be a part of something big. I hope you will consider getting in on the action!

This article was originally published in Toptal.

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Careers