Guest blog post by Tony Agresta
Organizations are struggling with a fundamental challenge – there’s far more data than they can handle. Sure, there’s a shared vision to analyze structured and unstructured data in support of better decision making but is this a reality for most companies? The big data tidal wave is transforming the database management industry, employee skill sets, and business strategy as organizations race to unlock meaningful connections between disparate sources of data.
Graph Databases are rapidly gaining traction in the market as an effective method for deciphering meaning but many people outside the space are unsure of what exactly this entails. Generally speaking, graph databases store data in a graph structure where entities are connected through relationships to adjacent elements. The Web is a graph; also your friend-of-a-friend network and the road network are graphs.
The fact is, we all encounter the principles of graph databases in many aspects of our everyday lives, and this familiarity will only increase. Consider just a few examples:
- Facebook, Twitter and other social networks all employ graphs for more specific, relevant search functionality. Results are ranked and presented to us to help us discover things.
- By 2020, it is predicted that the number of connected devices will reach nearly 75 billion globally. As the Internet of Things continues to grow, it is not the devices themselves that will dramatically change the ways in which we live and work, but the connections between these devices. Think healthcare, work productivity, entertainment, education and beyond.
- There are over 40,000 Google searches processed every second. This results in 3.5 billion searches per day and 1.2 trillion searches per year worldwide. Online search is ubiquitous in terms of information discovery. As people not only perform general Google searches, but search for content within specific websites, graph databases will be instrumental in driving more relevant, comprehensive results. This is game changing for online publishers, healthcare providers, pharma companies, government and financial services to name a few.
- Many of the most popular online dating sites leverage graph database technology to cull through the massive amounts of personal information users share to determine the best romantic matches. Why is this? Because relationships matter.
In the simplest terms, graph databases are all about relationships between data points. Think about the graphs we come across every day, whether in a business meeting or news report. Graphs are often diagrams demonstrating and defining pieces of information in terms of their relations to other pieces of information.
Traditional relational databases can easily capture the relationship between two entities but when the object is to capture “many-to-many” relationships between multiple points of data, queries take a long time to execute and maintenance is quite challenging. For instance, if you wanted to search for friends on many social networks that both attended the same university AND live in San Francisco AND share at least three mutual friends. Graph databases can execute these types of queries instantly with just a few lines of code or mouse clicks. The implications across industries are tremendous.
Graph databases are gaining in popularity for a variety of reasons. Many are schema-less allowing you to manage your data more efficiently. Many support a powerful query language, SPARQL. Some allow for simultaneous graph search and full-text search of content stores. Some exhibit enterprise resilience, replication and highly scalable simultaneous reads and writes. And some have other very special features worthy of further discussion.
One specialized form of graph database is an RDF triplestore. This may sound like a foreign language, but at the root of these databases are concepts familiar to all of us. Consider the sentence, “Fido is a dog.” This sentence structure – subject-predicate-object – is how we speak naturally and is also how data is stored in a triplestore. Nearly all data can be expressed in this simple, atomic form. Now let’s take this one step further. Consider the sentence, “All dogs are mammals.” Many triplestores can reason just the way humans can. They can come to the conclusion that “Fido is a mammal.” What just happened? An RDF triplestore used its “reasoning engine” to infer a new fact. These new facts can be useful in providing answers to queries such as “What types of mammals exist?” In other words, the “knowledge base” was expanded with related, contextual information. With so many organizations interested in producing new information products, this process of “inference” is a very important aspect of RDF triplestores. But where do the original facts come from?
Since documents, articles, books and e-mails all contain free flowing text, imagine a technology where the text can be analyzed with results stored inside the RDF triplestore for later use. Imagine a technology that can create the semantic triples for reuse later. The breakthrough here is profound on many levels: 1) text mining can be tightly integrated with RDF triplestores to automatically create and store useful facts and 2) RDF triplestores not only manage those facts but they also “reason” and therefore extend the knowledge base using inference.
Why is this groundbreaking? The full set of reasons extends beyond the scope of this article but here are some of the most important:
Your unstructured content is now discoverable allowing all types of users to quickly find the exact information for which they are searching. This is a monumental breakthrough since so much of the data that organizations stockpile today exist as dark data repositories.
We said earlier that RDF triplestores are a type of graph database. By their very nature, the triples stored inside the graph database (think “facts” in the form of subject-predicate-object) are connected. “Fido is a dog. All dogs are mammals. Mammals are warm blooded. Mammals have different body temperatures, etc…” The facts are linked. These connections can be measured. Some entities are more connected than others just like some web pages are more connected to other web pages. Because of this, metrics can be used to rank the entries in a graph database. One of the most popular (and first) algorithms used at Google is “Page Rank” which counts the number and quality of links to a page – an important metric in assessing the importance of web page. Similarly, facts inside a triplestore can be ranked to identify important interconnected entities with the most connected ordered first. There are many ways to measure the entities but this is one very popular use case.
With billions of facts referencing connected entities inside a graph database, this information source can quickly become the foundation for knowledge discovery and knowledge management. Today, organizations can structure their unstructured data, add additional free facts from Linked Open Data sets, combine all of this with a controlled vocabulary, thesauri, taxonomies or ontologies which, to one degree or another, are used to classify the stored entities and depict relationships. Real knowledge is then surfaced from the results of queries, visual analysis of graphs or both. Everything is indexed inside the triplestore.
Graph databases (and specialized versions called native RDF triplestores that embody reasoning power) show great promise in knowledge discovery, data management and analysis. They reveal simplicity within complexity. When combined with text mining, their value grows tremendously. As the database ecosystem continues to grow, as more and more connections are formed, as unstructured data multiplies with fury, the need to analyze text and structure results inside graph databases is becoming an essential part of the database ecosystem. Today, these combined technologies are available and not just reserved for the big search engines providers. It may be time for you to consider how to better store, manage, query and analyze your own data. Graph databases are the answer.
If there is interest, you can learn more about these approaches under the resources section of www.ontotext.com