Databases
As you become more ambitious in the types and quantities of data you use, you’ll need to look beyond traditional methods of storing and retrieving data. Recent innovations in non-traditional databases for diverse data types form a
fundamental component of the big data ecosystem. These databases are often referred to as noSQL databases, an acronym for ‘not only SQL’, since data can be retrieved from them in ways beyond the Standard Query Language (SQL). A key feature typifying these new databases is that the structure (the schema) can be defined on the fly, so we call them ‘schema-less databases’ and talk about
‘schema-on-read’. They are typically designed for efficient horizontal scaling, so we can grow their capacity with additional, rather than more expensive,
machines.
You’ll need to choose the traditional and non-traditional databases that are most helpful for your applications. I’ll now briefly review some of the primary types of databases within the big data ecosystem. To give an idea of the extent of industry activity around each database type, I’ll add in parentheses the category ranking scores from db-engines.com as at July 2017.68
Relational databases (80 per cent)
These have been the standard databases for operational use for the past 30–40 years. They sit within a relational database management system (RDMS) and consist of individual tables containing data rows with pre-determined columns, such as first name, last name, customer ID, phone number, etc. The tables are
related when they share columns with the same information. For example, if there is a customer ID column in both the customer details table and the sales table, then you can compute sales grouped by customer postal code when you cross-reference the two tables. The same relational database can be designed for operational use or designed for use in analytics and reporting (as a data
warehouse).
Document-oriented databases (7 per cent)
These are designed for large-scale storage and retrieval of documents, typically containing data stored in flexible XML or JSON formats. The most commonly used document-oriented database is MongoDB, which is open-source.
Document-oriented databases serve well as gateway noSQL solutions, since they can quickly provide general functionality.
Search engine databases (4 per cent)
These are used to power onsite search on many websites, returning search results over vast amounts of inventory using customizable logic to match results to user search queries. With such fundamental functionality, they are often the first foray of websites into the big data ecosystem and are designed to address both the velocity and the variety challenges of big data, particularly for search. These databases are sometimes used for general data storage and analysis, although care should be taken here. Some of the most commonly used search engine databases are Elasticsearch, Solr and Splunk.
Key-value stores (3 per cent)
Entries in these databases are simply key-value pairs. They can get many simple results very quickly, which is particularly useful for online, customer-facing applications. Key-value stores address the velocity challenge of big data.
Wide column stores (3 per cent)
Similar to relational databases in functionality, but providing the flexibility to add data fields on the fly, wide column stores address the variety challenge of big data. For example, a relational database might have 20 pre-defined customer data columns, whereas a wide column store would allow on-the-fly creation of any column type for any customer. If you started a new initiative after several years, such as a premium membership class, you could simply add the required
additional columns, such as membership number or total membership points, to a selection of customer records. The data rows for non-members would not change.
Graph databases (1 per cent)
These databases store data in the structure of a graph (a network of nodes and edges). They allow you to query data based on attributes and relationships. For example, you could easily find all of a customer’s third-degree connections with
a given postal code and membership status. Graph databases take advantage of sparsity and structural features to enable very fast execution of queries that would involve tragically slow multiple inner joins on a traditional relational database. In Chapter 6, we saw an example of using a graph database to de-duplicate
customers.
Choosing a database
You may be overwhelmed by the several hundred databases available in the market today. When selecting a database appropriate to your use case, consider not only the type and the cost of the database, but also its place within your current technology stack, the breadth of its adoption within the industry (which impacts staffing, maintenance and future capabilities), its scalability,
concurrency and the tradeoff between consistency, availability and partition tolerance (according to Brewster’s CAP theorem, proven in 2002, any database can have at most two of these three). Some of these factors may be critical for your application, while others may be less important.
You can find an ordered list of the currently popular databases for different
categories at db-engines.com, which also shows recent trends (see Figure 8.3). At the time of writing, time series databases have been gaining interest faster over the past 12 months than any other type of database (possibly due to their use in IoT), but they are still very much overshadowed by the other types mentioned above. The market research and advisory firms Gartner and Forrester regularly publish detailed analysis of the databases offered by many larger vendors in their publications known as Gartner Magic Quadrants and Forrester Waves.
Figure 8.3 Number of listed database systems per category, July 2017.68
Programming languages
When developing your analytic models, choose a programming language that fits within your broader IT organization, has well-developed analytics libraries, and integrates well with other data and analytic tools you are likely to use. There is no single best language for analytics, but the top two contenders in online forums are R and Python, at least for the initial stages of development.
In addition to personal preferences, check constraints of the IT environment in which you are working and of third-party software you might use. For example, Python is typically one of the first languages supported by open-sourced big data projects (as was the case for TensorFlow and Hadoop streaming), but many analysts come out of academia with extensive experience in R. Those from banking environments are typically familiar with SAS, which itself has an extensive ecosystem, including the powerful (and relatively expensive) SAS Enterprise Miner.
Some companies allow analysts to choose their own language for prototyping models, but require that any model deployed to a production environment first be coded in a compiled language such as C++ or Java and be subjected to the same rigorous testing and documentation requirements as all other production code.
Some deploy the analytic models as REST services, so that the code runs separately from other production code.