China released its first industry-standard big data benchmark suite

Recently, Chinese Academy of Sciences and China Academy of Telecommunication Research, together with a lot of industry partners, including Huawei, Microsoft (China), IBM CDL, Intel (China), Baidu, China Mobile, Sina, ZTE, INSPUR and etc released China’s first industry-standard big data benchmark suite BigDataBench-DCA. The specifications have been submitted to and under review of China’s Ministry of Industry and Information Technology.

The specifications and source code are publicly available here  .

BigDataBench-DCA has six real-world data sets, including unstructured text, semi-structured text, unstructured graph, structured and semi-structured table data, their corresponding scalable data generations tools, and ten I/O intensive or CPU-intensive or hybrid workloads.

BigDataBench-DCA is a subset of BigDataBench¡ª-an open-source big data benchmark suite.
The current version, BigDataBench 3.1 models five important big data application domains: search engine, social networks, e-commerce, multimedia analytics, and bioinformatics. In specifying representative big data workloads, BigDataBench focuses on units of computation that are frequently appearing in OLTP, Cloud ¡°OLTP¡±, OLAP, interactive and offline analytics in each application domain.
Meanwhile, it considers variety of data models with different types and semantics. BigDataBench also provides an end-to-end application benchmarking framework to allow the creation of flexible benchmarking scenarios by abstracting data operations and workload patterns, which can be extended to other application domains.

For the same big data benchmark specifications, different implementations are provided in BigDataBench, e.g., the offline analytics workloads using MapReduce, MPI, Spark, DataMPI, interactive analytics and OLAP workloads using Shark, Impala, and Hive. In addition to including real-world data sets, BigDataBench also provides several parallel big data generation tools¡ªBDGS¡ªto generate scalable big data, e.g., a PB scale, from small or medium-scale real-world data while preserving their original characteristics.

To model and reproduce multi-application or multi-user scenarios on Cloud or datacenters, BigDataBench provides the multi-tenancy version, which supports flexible setting and replaying of mixed workloads according to the real workload traces¡ªthe Facebook, Google and Sogou traces. For system and architecture researches, i. e., architecture, OS, networking and storage, the number of benchmarks will be multiplied by different implementations, and hence become massive.
To reduce the research or benchmarking cost, a small number of representative benchmarks, called the BigDataBench subset, are selected according to workload characteristics from a specific perspective.
For example, for architecture communities, as simulation-based research is very time-consuming, the BigDataBench architecture subset  is provided on the MARSSx86, gem5, and Simics simulators, respectively.


Google X’s next moonshot is to conquer the human genome


The latest Moonshot project of the Google X secret laboratory revolves around the collection of medical data. The mysterious division’s experiment is called the Baseline Study, according to the Wall Street Journal and involves the design of an ambitious database, which will see genetic information mapped so that the company can draw up a picture of what the healthiest possible human body would look like.

The first stage of the project will involved harvesting the genetic and molecular data from 175 people in order to create the database, which will have much larger and broader datasets than any other studies of its kind. The hope is it will become a tool to help physicians detect and treat major health issues.

As such, the study won’t place any emphasis on tackling specific diseases, but will collect many hundreds of samples using all manner of diagnostic tools. Once the data is accumulated, Google will able to scan through it and let its computers discover patterns that will serve as biomarkers for disease discovery.

Rather than focussing on finding cures for various diseases, the project will work entirely on developing preventative medicines, techniques and technologies, including diagnostic tools that are able to work better at an earlier stage. It is conceivable, for example, that one biomarker could be associated with an inability to break down fatty foods or an ability to prevent heart disease.

Andrew Conrad who works for Google’s research team told the WSJ that people shouldn’t expect immediate cures to complex diseases, but that he hopes advances will be made in “little increments”. Ultimately the project is a huge gamble and there is no guarantee that it will result in the researchers discovering biomarkers that tell them anything major, or even anything at all.

Information that is collected will include participants’ full genomes and entire genetic history. Google has promised that all data collected as part of the research will remain private and will not be handed over to insurance companies. The medical school boards at Duke and Stanford Universities, which are involved in the project will oversee the research, recruiting the volunteers and ensuring that their data is anonymised before it is handed over to Google.


[daily graph news]Google’s open source graph database

Google Releases Cayley Open-Source Graph Database

Cayley will be used to help Google continue to refine the idea of linking data together in graph databases, including Google’s Knowledge Graph.

Google has been using, improving and boosting its Knowledge Graph search services for several years to show users how information can be linked together in graphics form to help find desired results. Now it is again pushing forward in the graph database world through the open-source release of Cayley, which will be used in the continuing development of graph databases.


The availability of Cayley was announced by Google software engineer Barak Michener in a June 25 post on the Google Open Source Blog. “Four years ago this July, Google acquired Metaweb, bringing Freebase and linked open data to Google,” he wrote. “It’s been astounding to watch the growth of the Knowledge Graph and how it has improved Google search to delight users every day.”


Since then, the concepts of Freebase and its linked data have spread through Google’s worldwide offices, wrote Michener. “I began to wonder how the concepts would advance if developers everywhere could work with similar tools. However, there wasn’t a graph available that was fast, free, and easy to get started working with. With the Freebase data already public and universally accessible, it was time to make it useful, and that meant writing some code as a side project.”


Google is making that happen now with the release of Cayley, an open-source graph database that is being called a “spiritual successor” to graphd, wrote Michener. Cayley “shares a similar query strategy for speed” with graphd, while adding its own unique features, including RESTful API, multiple (modular) back-end stores such as LevelDB and MongoDB, multiple (modular) query languages and ease-of-use features that make it convenient to work with for developers, he wrote.


Internet of Things: Choose an Intelligent Database

“Cayley is written in Go, which was a natural choice,” he added. “As a backend service that depends upon speed and concurrent access, Go seemed like a good fit. Go did not disappoint; with a fantastic standard library and easy access to open source libraries from the community, the necessary building blocks were already there. Combined with Go’s effective concurrency patterns compared to C, creating a performance-competitive successor to graphd became a reality.”


To illustrate some of the uses of Cayley, Google developers created a YouTube video that describes the building of a small knowledge graph using the application. “The video includes a quick introduction to graph stores as well as an example of processing Freebase and linked data,” he wrote.


Interested developers can also check out a demo dataset in a live instance running on Google App Engine to see how it works. “It’s running with the sample dataset in the repository — 30,000 movies and their actors, roles, and directors using Freebase film schema,” wrote Michener. “For a more-than-trivial query, try running the following code, both as a query and as a visualization; what you’ll see is the neighborhood of the given actor and how the actors who co-star with that actor interact with each other.”


The open-source project is hosted on GitHub.


Graph search, an open-source database project built on all the networking we do online every day, is the most far-reaching search IT to go mainstream since Google started storing up and ranking Websites more than a decade ago, according to an earlier eWEEK report. Basically, a graph search database anonymously uses all the contacts in all the networks in which you work to help you find information. Anything you touch, any service you use and anything people in your networks touch eventually can help speed information back to you. It avoids anything non-relevant that would slow down the search.


Google is a large user and producer of open-source software.


In December 2013, Google joined the Open Invention Network (OIN), which was created in 2005 as an intellectual-property company that works to promote, protect and openly share Linux patents among its members and the open-source community. The OIN is a consortium of open-source user companies. The other members of the OIN are IBM, NEC, Philips, Red Hat, Sony and SUSE, a business unit of Novell. Canonical and TomTom are associate members of the group. Google had previously been involved with the OIN since 2007 as an “end-user licensee,” according to the OIN.


Facebook has also been experimenting with real-time graph search, which enables users to quickly find content they have touched at some point in their Facebook lifetimes, according to an eWEEK report in January 2013. Queries written in the blue bar across the top of the Graph Search page can fetch photos, videos, links, documents—anything the user has touched or shared, or had shared with—on Facebook from the first day the user joined the social network.

– See more at:

[daily graph news] 50 Shades of Graph: How Graph Databases Are Transforming Online Dating [from Forbes]

[Yet Another example of] online dating using graph databases.


When it comes to dating, everybody is highly motivated. So it is no surprise that the nerdy among us put their advanced knowledge to work when seeking out a mate. The most recent celebrated example is Chris McKinlay, who used a statistical modeling approach to find which type of women to go after. The result: after 88 dates, McKinlay found the right woman for him, who, as it turns out, had been hacking her profile in a different way (see “How a Math Genius Hacked OkCupid to Find True Love”).

But interest in applying technology to find love is also highlighting a shift toward graph database technology that is starting to transform applications in a large number of industries. Here is the evidence:

  • Several of the largest dating sites in the world have shifted toward graph databases in the last nine months.
  • LinkedIn has a large team working on a proprietary graph database, which sits at the center of nearly every operation at LinkedIn.
  • Twitter depends on a graph database, and has released FlockDB, a graph database it created, as open source.
  • Neo Technology, the creator of Neo4j, the most popular graph database, now has more than 30 Global 2000 companies adopt its technology, including enterprise brands like Wal-Mart, eBay, Lufthansa, and Deutsche Telekom.
  • Teradata just released a new type of SQL called SQL-GR, intended to make the graph analytics easy for enterprise users.
  • According to a report by industry observer DB-Engines, “Graph DBMSs are gaining in popularity faster than any other database category,” growing 300 percent since January of last year.

It seemed appropriate to use Valentines Day and online dating as an opportunity to explore why graph databases are increasingly powering the search for love, as well as what the lessons are for other sorts of applications.

It’s the Relationships, Stupid!

Social Graphs are becoming more and more crucial to online dating, as dating companies discover how much more accurate their recommendations become when considering the network effects.

Snap Interactive, the company behind the dating site AYI – are you interested?, uses aone billion person social graph to significantly improve the likelihood of finding a match. It does this by using the graph to recommend people in one’s extended social network: friends-of-friends, and friends-of-friends-of-friends, who statistically speaking are much more likely to go out on a date than complete strangers. In just the last six months, more than half a dozen online dating companies around the world have quietly implemented graph databases to help them bring the power of the network into their decision-making. Key graphs include not just the social graph, but also the passion graph (of shared interests), location graph, and others.

Glassdoor, which is for careers and jobs what Yelp is for food, accomplishes much the same thing, but with companies, jobs, and job seekers, also with a graph of nearly a billion people, consisting of its users and their friends. Both Snap and Glassdoor report they have significantly improved the accuracy of their recommendations by using a graph to navigate their connected data. By finding and making better use of networks, many different types companies are breaking new ground with respect to intelligent real-time analytics. In his session at Strata, Eifrem, CEO of Neo Technology, reported that many people, once they learn what a graph database can do, start seeing graphs absolutely everywhere.

Treating relationships, sometimes called the edges of a graph, as a first class object is the fundamental innovation of graph databases. The database doesn’t only store just information about individual things, but it also stores the relationships between those things. This capability makes it much easier to express sophisticated questions, and get answers in a small fraction of the time it takes a traditional database. The relationships in the database can express the nature of each connection (parent, child, owns, friend) and capture any number of qualitative or quantitative facts about that relationship (weighting, start and end date, etc.).

Because of this you can write a queries that express constraints like:

  • Find all men who are connected within three friends of my women friends who like sailing but not bowling and who live within 30 miles of my zip code.
  • Find all women who don’t know any of my friends within two levels, but enjoy spending time in some of the same places that I do.
  • Find all men in my friends-of-friends network who enjoy the most activities similar to me.

Queries like the ones described above can take pages of SQL and execute slowly on relational databases. A graph database can return results in a snap, breaking existing SQL speed limits, often with just a few lines of code.

Signs You May Need A Graph Database

Eifrem, who has always been a massive booster of graph databases, was surprised how quickly companies have been finding new uses for graph databases in the last few years.

“When we started out, we thought that we would find acceptance in three key vertical markets: internet services and independent software vendors, financial services, and telecom companies,” said Eifrem, who is also the co-author of the O’Reilly book Graph Databases. “But it has turned out that we have found a home for our graph database in dozens of industries, across an even greater variety of use cases.”

Those use cases range from recommendations and real-time analytics, to fraud detection, impact analysis, identity & access management, portfolio management, resource optimization, product line management, and others.

In addition, the related field of graph analytics is also growing. In this model, data is stored in many repositories, not just in a graph database, and is brought together into a graph analytics engine for a particular analysis. Loosely speaking, graph databases are like OLTP databases and graph analytics engines are like OLAP systems. When looking at a graph-based technology, the first question to ask is: Is it a database or an analytics engine. You can find out more about the popularity of graph databases space today at industry observer DB-Engines, an organization that ranks databases by popularity. Right now, Neo4J dominates the space, but there are many entrants that have powerful support such as the Apache Giraph project which is based on Hadoop.

Eifrem said that the rise in acceptance of graph databases is based on several factors:

  • The world is connected. The value in computing these days is no longer about automating business processes (this was the dominant use case when the relational database was born). Today’s problems center around understanding the real world in all its connected and dynamic glory. The world is a graph: might as well embrace it.
  • Change happens fast. The world moves a lot faster than it did 20 years ago. Back then, it didn’t matter if it took months and years to design and write systems. Today the timescale is months. Putting connected data into a relational database is hard. And getting it out is even harder. By putting your graph data into a graph database, the modeling time and development time are both drastically reduced, and it’s much easier to change the model once the system has been built.
  • The need for speed. The best decisions are the ones made with the very latest information. Graph databases are brilliant at answering very complex and valuable questions in real time. This has been a holy grail of sorts in the analytics space, commonly referred to as real time analytics. Older database systems often grind to a halt when trying to answer these kinds of questions. For certain types of graph-friendly questions, if ask the question today, and you might get the answer tomorrow. By then the customer has left your web site. The workaround has been to pre-calculate all of your recommendations at night and serve them up during the day. This sounds great until you realize something really important happened between last night and this moment that affects how you want to treat that person.

Eifrem said that any of the following problems may indicate that you should consider a graph database:

  • Performance problems with your relational database due to the complexity of your queries or data structures. According to Eifrem, “our customers have often reported an performance increase of 1000x or more over Oracle and MySQL for certain queries.”
  • Projects taking a very long time. “This can be a side-effect of trying to make graphy data fit into tables: you end up with very long and complicated queries and code. Because it’s convoluted and hard to understand, it takes a long time to write and test, not to mention tune. This happened to me when I was CTO of a startup back in 2000, and is what led me to design the property graph model (literally on a napkin)”
  • If you feel like your business is being trapped by your data, and there are questions you want to be able to ask that just can’t be answered, then you might want to see if a graph database can unlock your hidden graph.

“Our Neo4j solution is literally thousands of times faster than the prior MySQL solution, with queries that require 10-100 times less code,” said Volker Pacher, Senior Developer at eBay, who has been using Neo4j for the last year. “At the same time, Neo4j allowed us to add functionality that was previously not possible.”

Graph databases seem to be well on their way to following the trajectory of on-line dating, which started as an oddity, then became increasingly successful, and now is accepted as a great way for many people to find a mate. If you seem to have a graphy problem, you may want to set up a date between your data and a graph database sometime soon.

Follow Dan Woods on Twitter:


Follow @danwoodsearly

Dan Woods is CTO and editor of CITO Research, a publication where early adopters find technology that matters. For more stories like this one visit Dan has done research for Teradata.

[daily graph news] With Google Glass, Demand for Graph Databases Increases

ran into the following post from Neo4j blogs. Now think about a world of wearable devices and computers that communicate with each others… providing data streams of various types of information, in a highly dynamic networks of moving objects.

TechCrunch in an article on Apigee and predictive analytics technology takes note of the increasing demand for graph databases.

…there is the increasing amounts of data that people and machines create. With that scaling in data, there is a growing demand for new types of analytics capabilities. Graph databases are becoming more popular for the varied amounts of data they aggregate and analyze. These graph databases organize nodes, which might be things like a street light or people. The properties of a graph database describe the nodes. A graph database also has “edges” that connect the nodes and properties, defining the relationship between them. The value is derived when analyzing the patterns between the nodes and the properties.

As sensors become more widely used in wearables such as Google Glass, the demand for graph databases will increase. It will be important to correlate the data from the any number of sensors that might be in a house, a car or city street. There will also be the need to analyze increasing amounts of text from medical records, contracts, etc.

Breaking down the walls

While I’m chating with researchers I always heard “database guys” or “data mining guys” or “system guys”. While tagging with “X guys” help to know what a researcher is doing and what kind of areas/papers/conferences he may be active in, it somehow reduces possibility of collaboration and inspiration. We need channels to encourage multidisciplinary. Refuse a close world research. Stop building a high wall of jargon and stop setting a guard at the doors who will only let those speak their languages in. One change comes into my mind is to start some crowd-sourcing connecting researchers from different areas, arguing the same issue from different perspectives. I believe this leads to better solutions and effective collaborations for many real-world problem. If or Stackoverflow/Mathoverflow works, there is no good reason/excuse for a failure of such a platform for professional researchers.

An article by Phil Bernstein– changes are suggested to major database conferences towards system conferences.
Systems & Databases: Let’s Break Down the Walls
Making such connections can always introduce surprise in a good way.

big graph startups

Graph databases use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element. No index lookups are necessary. Graph database are faster when it comes to associative data set compared to relational databases. As they do not need join operations, they can scale naturally to large data sets.


Gephi helps people understand and discover graphs and patterns. It uses a 3D engine to show graphs in real-time that can help users make hypothesis, isolate structure singularities or fault during data sourcing. It is written in Java on the Netbeans platform. It can be used to analyse graphs extracted from OrientDB.


FlockDB is a simple graph database and is intended for online low-latency, high throughput environments, like websites. FlockDB is being used by Twitter to store social graphs. It is a distributed graph database and can support complex arithmetic queries. The database is licensed under the Apache License.


GraphBuilder can reveal hidden structures in big data as it can construct graphs out of large sets of data. It is developed by IBM, built in Java, it uses Hadoop and it scales using the Map Reduce parallel processing model. The GraphBuilder library takes care of many of the difficulties of graph construction, such as graph transformation, formation and compression.


InfoGrid is developed in Java and at the heart of it lies the GraphDatabase. It offers many additional software components that makes it easy to develop graph based web applications. InfoGrid is sponsored by NetMesh and they offer also commercial support for using InfoGrid.


Meronymy is a SPARQL database server that offers among others names graphs. It is written in C++ and is a NoSQL database management system. The high performance cross-platform Resource Description Framework is especially tailored for big data. It is usable with a lot of programming language and is expected to enter the Beta phase in 2013.


InfiniteGraph helps users to ask more complex and deeper questions across their data stores. It can work with massive amounts of distributed data and especially those projects that need more than one server will benefit the most from the graph database. It offers high-speed graph traversals, scalability and parallel consumption of the data.

AllegroGraph 4.9

AllegroGraph is a graph database with MongoDB integration. It is developed for high-speed and high-performance loading and query speed. It uses effective memory utilization and it can scale to massive amounts of quads. It supports SPARQL and RDFS++ and it has a Javascript-based interface.


Gremlin is a graph traversal language and it can be used for graph analysis, query and manipulation. Gremlin works with the graph databases that have included the Blueprints property graph data model. These include among others Neo4j, OrientDB, InfiniteGraph. Gremlin provides native support for Java and Groovy.


HyperGraphDB is a general purpose data storage mechanism designed for knowledge representation. It is based on directed hypergraphs and offers graph-oriented storage. It can be used as an embedded object-oriented database for Java projects or as a NoSQL relational database. The core of the database engine is designed for generalized, typed and directed hypergraphs.


GraphBase is a Graph Database Management System that was built from scratch in order to manage large graphs. It makes huge, very structured data stores possible. Graphbase simplifies the usage of graph-structured data, instead of working with very complex spaghetti-like structures. With GraphBase Singleview, it becomes possible to turn a database into a single, searchable and navigable graph.


Brightstar DB Mobile and Embedded are the open-source tools of BrightstarDB. It is a NoSQL database designed for the .NET platform that is fast, embeddable and scalable. It does not need fixed schema, which gives is a lot of flexibility in what and how the data is stored. Its associative data model fits perfectly with real world applications.

[daily graph news] Intel Adds Graph Builder to Big Data Tools

Intel adds graph and other big data software updates, Trinity Pharma raises $15 million to grow its health care analytics offering, and Spice Machine and MapR collaborate for real-time SQL-on-Hadoop databases.

Intel updates big data tools. Intel (INTC) announced several updates to its data center software products that provide enhanced security and performance for big data management, as well as a suite of tools that simplify deployment of machine learning algorithms and advanced analytics, including graph analysis. The announcements include the release of Intel Graph Builder for Apache Hadoop software v2.0, Intel Distribution for Apache Hadoop software 3.0, Intel bAnalytics Toolkit for Apache Hadoop software and the Intel Expressway Tokenization Broker. Intel Graph Builder for Apache Hadoop software v2.0 is a set of pre-built libraries that enable high-performance, automated construction of rich graph representations. Intel Distribution for Apache Hadoop software 3.0 includes a number of security enhancements to the second generation of the Apache Hadoop architecture recently released by the open source community. The Intel Distribution for Apache Hadoop software 3.0 includes support for Apache Hadoop 2.x and YARN with major upgrades to MapReduce, HDFS, Hive, HBase, and related components. ”Some of the leading data-driven companies have invested heavily to create and implement their own big data analytics solutions,” said Boyd Davis, vice president and general manager of Intel’s Datacenter Software Division. “Intel is bringing this capability to market by providing software that is more secure and easier to use so that companies of all sizes can more easily uncover actionable insight in their data.”

Trinity Pharma raises $15 million in growth capital. Big data health care analytics company Trinity Pharma Solutions announced that it has raised $15 million in growth and expansion funding. The investment will enable Trinity to extend its solutions in response to strong demand for its cloud-based, big data healthcare analytics.  “Over the last 12 years, Trinity has built a proven business and technology that is solving the increasingly complex challenges that life sciences and healthcare companies face worldwide. The rapid change in healthcare requires real-time analytics to provide value beyond counting pills and with the potential to improve patient outcomes,” said David Tamburri, HEP General Partner. “With a seasoned management team, led by Co-Founder and CEO Zackary King, we see tremendous opportunity to accelerate Trinity’s growth in support of demand for its cloud-based software.”  Trinity expects to invest in the areas of sales, marketing, and technology to better serve its rapidly expanding customer base. The company also plans to double its employee base and expand its geographic reach with offices in New Jersey and California.

Splice Machine and MapR partner for Hadoop database.  Real-time transactional SQL-on-Hadoop database provider Splice Machine announced a partnership with MapR Technologies. The partnership brings Splice Machine to the MapR enterprise Hadoop platform, enabling companies to use the MapR Distribution for Hadoop to build their real-time SQL-on-Hadoop applications. Splice Machine enables MapR Distribution users to tap into real-time updates with transactional integrity, an important feature for companies looking to become real-time, data-driven businesses. “This partnership is another step in the progression of Hadoop, from a highly scalable data store, to a real-time, high-performance platform for operational and analytical applications,” said Bill Bonin, VP of business development of MapR Technologies. “Now, companies have the ability to combine our enterprise-grade Hadoop distribution with Splice Machine to build real-time, transactional applications that are also dependable, scalable and secure.”