Unisim-SOFT

Articles
Выбор информационных систем учета и управления
Database. Global trends in Russia

Database. Global trends in Russia

Выбор информационных систем учета и управления
15 January 2021

This article is not an answer to many questions about databases (DB) and database management systems (DBMS). As an author, I express my own opinion about trends, trying to rely on impartial indicators, statistics, etc., but citing my own experience as an example. I am not an engaged representative of any company and I express my point of view based on my experience of more than 25 years of working with various DBMS, including one that I created with my own hands. There are not so many even experienced programmers and architects who know all the terms, technologies, what pitfalls and where the movement is heading. The topic is truly huge, so even the top level of information cannot be disclosed within one article. If someone does not come across their favorite DBMS or its incredible plus that is worth mentioning, then please indicate in the comments and add to the overall picture, which will help others to understand and better understand the subject area. Let s go!

Open Source DBMS vs Commercial DBMS

To begin with, I will give a graph from the site, db-engines.com, in my opinion, it is good at tracking database trends. It was this graph that added the desire to write an article on the current state of affairs. When we say the phrase "database", in fact, we often mean a specific database management system (DBMS), so if the text contains a DB instead of a DBMS, it is due to such a habit.

Open Source DBMS - Open source database management systems have caught up with commercial closed source DBMSs. Open Source 49.98% versus 50.02% for Commercial. The end of 2020 is the moment when it will be possible to say that open source is no less popular. As you can see, this situation did not arise suddenly. The calculation on the graph is not in a numerical ratio, but in points, which are gained by certain systems.

For interest, you can look at the site where your favorite DBMS is in the rating. The result of the last year was the departure of Microsoft Access from the top ten; together with the COBOL programming language, it reminds that the life cycle of technologies can be very long. I believe that next year IBM DB2 will fall the most in the top. Top 10 DBMS is 75% of all points scored. Over the year, the top 10 in glasses has hardly changed.

Open Source has grown in quality in recent years and more and more IT decision-makers are wondering if the licenses of Oracle, MS SQL, IBM DB2 and other commercial products are worth investing in. The appetites of the companies of the same name contribute to this in no small measure. In recent years, it has become fashionable to sell enterprise licenses in commercial products for processor cores (enterprise-level licenses without artificial restrictions on functionality). You can use the link to calculate what comes out for a server with 4 processors with 16 cores - only 64 cores if licenses for users do not suit you.

MS SQL - $ 439,936

Oracle - $ 1,368,000

Moreover, the intricate licensing system, which is not always understood even by the companies selling these licenses, leads to a situation when they come to you and say that you have not purchased additional licenses for something more than you have already purchased and budgeted. And this is not a rare situation.

The past 2019 and current 2020 have seen the growing influence of AMD and its central processing units, including server EPYCs, which are dramatically changing the cost of physical cores, which means more cores will be bought. Multi-core AMD evolved with the introduction of chiplets and now 64 cores in one processor is a reality. Will companies be ready to buy paid licenses of paid DBMSs for many times more kernels? Not sure. A physical server will be obtained at an order of magnitude lower cost than licenses. The server market is rather inert and will not move to AMD in a few years, but Oracle, Microsoft and others will begin to lose share even faster if they do not change policy. Microsoft released a version of SQL Server for Linux, but again, for money, there weren't many people willing to buy it. IBM DB2 has been losing market share for a long time.

The transition of many companies to microservices (probably heard at Russian conferences and forums "we are sawing a monolith for microservices") often leads to the separation of data physically on different servers, for commercial DBMS these are paid licenses, for Open Source not. In order not to go out of the budget, they choose free solutions.

It would be appropriate to say that the business of commercial DBMS is profitable and owned by the richest people on the planet. In the top of the richest people are the owners of companies that own commercial DBMS: Bill Gates (Microsoft), Larry Ellison (Oracle), which occupy a significant share in the top DBMS. Still from the top 10 list of Forbes richest people who have huge or cloudy databases are Jeff Bezos (Amazon RDS), Larry Page and Sergey Brin (Google with their Bigtable).

Recently, large IT companies with big money have been developing open source projects. One would think that they were imbued with the ideas of the open source community, want to do something useful for society, they were filled with philanthropy and other patronage. But no, you need to understand that the big sharks in business expect at some stage to monetize these investments, increase a positive opinion about their companies, take over the markets, take advantage of the free labor of the community, and take ideas that have taken off. It's no joke that thousands of bright minds commit high-quality code to the repository, test it on real projects and don't ask for money.

The same MySQL, despite the name open source, was already sold for $ 1 billion to Sun Microsystem, and then absorbed by Oracle. MySQL founder Michael Widenius reopened the project and called it MariaDB. I was at his speech in Moscow where he urged to switch to their continuation of the original DBMS and I must say that a considerable part of the developers did just that. MariaDB is already in 12th place.

Commercial DBMSs are expensive, but there are a lot of advantages, they are stable, convenient, easily integrated into the IT infrastructure, there is a system for training specialists, third-party companies are expanding their functionality. In addition to being free and a growing part of the capabilities of paid DBMS, the advantage of open source is that you can take the database code and change it to suit your needs, as Facebook did with MySQL. For one open source DBMS, there can be several engines or several branches of development, and you can choose any option that suits you. Ranked in the top 10, SQLite is a multi-platform product. This is a personal database that does not use the Server-Client paradigm, when you just need to store data locally for the application with minimal costs and use all the advantages of the DBMS.

What in Russia: You need to understand that when the first commercial products appeared in 70-80s, for example Oracle, we still had Brezhnev, the Olympics in Moscow, perestroika, lag in the electronics industry, etc. We are still catching up or developing existing systems.

The first versions of the DBMS in the world were commercial, they were distributed pointwise, and required the presence of specialists. Historically, it is not always accepted to buy software (software) for currency in our country (we will omit the discussion of political and economic motives), therefore, there are alternatives in the form of free DBMS and pirated versions of commercial products. Nowadays, institutes are often taught on free database engines, open source software is stylish, fashionable, youthful. Therefore, the market is quickly filled with open source specialists. In Russia, there are DBMS developments in the form of ClickHouse, Tarantool, PostgreSQL branches, etc., there are no commercial exported databases. There is a register of Russian programs where you can inquire about the current state of affairs in domestic DBMS. The composition of the truth raises doubts, for example, along with the names that you might have heard, there is also a name such as "Passport office of university hostels."

The transition to open source in Russia has also accelerated in connection with the sanctions. I remember that the news about Oracle about a possible ban on selling technologies for the oil and gas industry in Russia changed the vector of vision of the future in the minds and budgets of some of our companies with good IT budgets. As said above, the phrase "we are sawing a monolith for microservices" often leads to Open Source. PostgreSQL, MySQL, SQLite, MongoDB projects are growing in Russia, as well as throughout the world.

It might seem to many that open source has long overtaken commercial, but this opinion may develop if you belong to the world of online projects, sites, applications for smartphones, etc. From this follows the following comparison online vs. offline.

Online versus offline projects database

There are 2 classes of projects that are similar in some ways, but not in others. These are projects with the main focus of online sales or offline business.

If you take a classic offline business related to natural resource extraction, banking, distribution (wholesale) or retail (retail), then it most often developed from a stack of technologies based on Windows solutions. And by going online, he can use the WISA technology stack (Windows, IIS, MS SQL, ASP.Net). There are also online projects originally on WISA, for example, the well-known StackOverflow. At the moment, purely online projects are dominated by a stack of technologies such as LAMP (Linux Apache MySQL PHP). New young teams entering an existing business often did not work with the Windows technology stack and offer to rewrite existing systems. In Russia, this trend has been very well felt in recent years.

Money loves the account, and to compare the share of online and offline projects, let's see the ranking by revenue in the world. In the top are retail, resource extraction, automotive, etc. In the largest companies (not only in Russia), there is usually a zoo of technologies, there will be ERP, CRM systems in the form of SAP, Microsoft Dynamics NAV or AX, 1C in Russia and much more. Companies with a large turnover use different enterprise management systems, which in turn often use commercial databases, Oracle, MS SQL, IBM DB2

But the capitalization of companies in the world has changed over the past 10 years and we see that IT giants are now in the top and only 2 companies from the past top Microsoft and Alphabet (Google). In the dynamics of change here. This means a shift in cash flow online. We are all accustomed to paying online and transferring money, buying goods with delivery, etc. And the current year 2020 has brought a lot of profits to online companies.

Rating of Russian companies by revenue. In the first places is resource extraction, banks, retail, not IT corporations. Yandex is in 113th place. True, Yandex is in the top 20 in terms of capitalization. There are trends in the introduction of more open source solutions, the reasons are described above. Historically, many of the largest online retailers (stores) in Russia use paid databases, but they are thinking about switching by testing parts of the functionality to open source. Banks have begun their migration to online, the example of Tinkoff Bank confirms the need to develop online banking.

At the end of the year, due to the current situation with the virus, the number and size of online projects clearly increased. Cloud solutions are also growing, more on that below.

Relational DBMS versus the rest

The abundance of articles, including on Habre, about different types of databases are misleading that the whole world is already ceasing to use classic relational DBMSs. In fact, RDBMS is the vast majority of all DBMS in the world. Other species account for about? parts and you need to understand that this is most often a specific functionality.

For example, I will say that at the moment, every day in the work of a typical large company, different DBMSs can be widely used simultaneously as paid MS SQL (Oracle) and another set of MongoDB, Redis, MySQL, ClickHouse, ElasticSearch, etc.

Let's consider very briefly the main types:

Relational: The main type associated with the database. Data is stored as 2-dimensional tables with specific columns and rows that store values. Indexes are used to speed up searches on the field or fields specified in the index creation. The connection between 2 tables goes through the same fields in them - keys (Key). To add, change, delete data, SQL (Structured Query Language) is used, see below. The description of the data structure is stored in the very same database in the data of system relational tables. This simple 2-dimensional table idea has stood the test of time and continues to be the most prevalent.

Document stores:

The difference from relational databases is that data is stored as documents with any structure. That is, the columns of the table are not hard-coded. However, you can create indexes that insert a link to a row if the specified document attribute is found. A typical representative of MongoDB is storing documents using JSON (Java Script Object Notation) syntax. Actually BSON (Binary JSON), which is more compact, like any binary types than strings.

This is how the rows in the collection look like (they are analogous to tables)

Likewise, tables can refer to each other.

Key-Value : this type of NoSQL solution appeared due to the need to quickly write, change and receive values ??by some parameter. This is often used for non-critical, rapidly changing values ??that make no sense to record and store. A typical representative is Redis. On an old laptop, you can get tens of thousands of operations per second for writing, changing data. This is achieved by the fact that the data is stored in memory, with which operations are faster. Hence the disadvantage that if there is not enough memory, then the speed will degrade. For example, you want to measure the number of requests from 1 IP per minute. A line is started where the key is IP, any call adds +1 to the counter. If there are many requests, then throttling (limitation). The key can be TTL and reset once every X minutes.

Search Engines: Search is an essential function in any system. If relational databases cope very efficiently and quickly with search for an exact match in the form of an ID (code, article, part number, etc.), then searching within a document by a phrase, including the use of different forms of a word, plural and other components of a living language, is already doesn't come out that fast. You need to scan data from start to finish and look for suitable documents. Therefore, they act as large search engines do by indexing pages, if presented in a simple way, then they go through the documents in advance, make a list of words that are found in the document and when a search is needed, they search for prepared lists of words with links to documents, the more words match the more likely that this document is needed. A typical representative of ElasticSearch, its large number of installations is also due to the fact that there is a typical ELK stack (English) ElasticSearch + Logstash + Kibana for monitoring events, for example, logs of web servers or services.

Wide column stores: It is best thought of as an average between a relational database and a Key-Value database. There are tables, rows and columns. But the columns do not have a rigid structure and can have different names and meanings in different lines.

Representatives of this type of database are Cassandra, HBase.

Graph : A type of database that is designed to process data that is presented in the form of graphs. In the form of graphs, you can represent settlements (vertices) and the roads between them (edge). A typical task is to find the shortest route between points by traversing the graph in width or using more advanced algorithms.

It is also convenient to represent by graphs the connections between people (vertices) that he knows someone (edge) or their age and interests. The formula of a chemical compound can be imagined that the vertices of the graph? -? Are the atoms of the molecule, and the edges are the chemical bonds between atoms. Graph theory is extensive and has been developing since the 18th century, so a large mathematical base has been accumulated. A typical representative of a graph DBMS is Neo4j.

Columnstore

Although in the db-engines rating this type does not go separately, but refers to relational, but it is worth mentioning it. Commercial relational DBMSs include this type as a separate feature, but there are specialized separate solutions. The main difference between columnar databases is that data is stored not in rows, but in columns. If you have the same values in a column, then they are compressed very much and take up less space on disk and in memory. Representatives of this type ClickHouse, Vertica. It is better to watch this picture with animation on the ClickHouse website.

Recently, it began to appear in the Yandex ClickHouse DBMS diagrams. The numbers are different, but the fact that they began to notice and include it is already good for its development.

Multi-model databases

In many DBMSs, in addition to the main historical storage type, new ones have been added over time. The world does not stand still, so if the creators of the DBMS saw the need to support other types of data storage, then they were added. Therefore, for most modern DBMSs from the top, the description may contain "multi-model". Moving large relational DBMSs to multi-model has limited the growth of many NoSQL solutions in recent years. Why use something else if the type you need is included in the main DBMS as a secondary one?

SQL vs NoSQL

The term NoSQL itself arose a little over 10 years ago, around 2009, and as they say now "HYIP went". Many new software products that were designed to solve a certain problem inherent in the Relational DB + SQL bundle proudly began to be called NoSQL to show that they are new and promote hitherto unseen technologies that can solve many problems. And there really were problems. The need was for the ability to easily scale out solutions due to the growth of data that could arrive in large quantities, the volume of data began to increase dramatically. Moreover, data that was not structured began to be saved, for example, information from the site was clicked on, where the user went, what he was looking for, what elements were popping up, the banner was shown, etc. All this is dumped and stored. Now you are not surprised that you are persistently shown ads on a topic that once interested you, you are confidently led to a decision funnel, about a purchase, a subscription, etc.

The data growth graph, it is a little not fresh, but showing the growth of data over 10 years ago.

From my own experience, I will say that in the 0s, offline companies generated more dollars in revenue per 1 Gb of data in the database than an online company does now. And the ratio is about 100-200 times less dollars per GB online.

The second SQL problem arising from the growth of data was heavy transactional operations on writes to the database and could not reach 10 or 100 thousand per second on simple equipment. I will not expand now on what is a plus and what is a minus of transactions, but it turned out that you can speed up transactions, simplify them, or often they are not needed. Not needed if there is so much data that, having lost some, you can easily restore the whole picture on the remaining data. Not all data is so important that partial loss of it is critical for business.

And simple, but effective NoSQL systems appeared that solve problems not associated with the accumulated limitations of SQL, work quickly and often also free of charge. But miracles do not happen, so often NoSQL solutions have limited functionality or work quickly until some resource runs out, for example, RAM or the number of connections.

You can often find this kind of pictures of classifications of NoSQL databases by types and with examples of DBMS. We examined them above.

And what about SQL - Structured Query Language - structured query language? It has existed since the early 1970s, was standardized, and thanks to these standards, which are supported by all the creators of relational databases, it minimizes the difference in working with different relational databases. Yes, manufacturers insert their own features (features) that may go beyond the SQL standards, since they offer some new competitive technologies, if they are massively supported, then these new technologies and their description will be included in the SQL standard over time. If you write SQL queries in one DBMS, then it will not be difficult for you to switch to another and continue working with it. Moreover, often in NoSQL DBMS clients, there is a feature in the form of queries on SQL. For example, for MongoDB I often use Studio3T, where you can write regular SQL and it translates into specialized MongoDB queries, for MongoDB itself there is a SQL adapter. ClickHouse and Tarantool (Russian developments) support SQL queries. Also, many NoSQL DBMSs have features inherent in SQL, for example, joins, data schemas, NULL logic for values, etc.

Cloud DBs vs DBs

Cloud technologies are gradually growing. This growth is tens and hundreds of percent per year for various cloud technologies. Among them are cloud databases. If in any area of the economy there is growth with such figures, then more and more efforts will be made by companies to occupy this market and get their piece of the pie, and then to interest more customers and include them in this pie that grows further. Yes, you and I are part of their pie. There are many articles on Habre about clouds, pros and cons, so as not to repeat yourself, you can read, for example, here or in another article on the corresponding tags.

According to Gartner's estimates, the volume of the entire cloud market will double in 5 years.

These are the distributions by company if we take BPaaS and IaaS. My feelings are very similar to the truth. AWS is the leader, Microsoft is catching up a little in recent years, Alibaba is growing and the cheapest in the Chinese market, which can no longer be ignored by global companies.

The database in the cloud (DBaaS) market looks much more modest in numbers compared to the numbers of all cloud spending, despite the fact that each company has its own databases and not small ones.

This is explained by such factors:

Often, companies do not use the provider's specific cloud databases, because applications need to be adapted, there are features that do not allow doing this. More often now, cloud infrastructure is used, that is, you get your virtual hosts with CPU, RAM, SSD (HDD) and use it to install instances of standard DBMS there.
Companies are in no hurry to transfer data to the cloud, transferring other services there. This raises the question of data security. It's one thing when your data is in a specific exact place of your network, physical hosting, albeit distributed around the world, and another thing when you pushed data into the black box of a cloud service provider and you do not understand how, where it is stored, how it is protected, etc.
Whoever faced the clouds knows that expenses can arise from where you do not expect and, accordingly, did not calculate the cost. Let me give you an example: putting data in cold cloud storage is not expensive, but downloading it back is tens or hundreds of times more expensive. Therefore, if you make backups and store them without touching, then it is cheaper than your disks, but when you want to download back, first you will wait many hours before extracting, and then pay a significant amount for each Mb. And this is the situation for AWS and for Azure and the rest. Here is a recent story about NASA, when they were told by auditors that they would pay much more. Cases of budget overruns from functional growth are quite common.
The speeds of information transfer from the cloud are sometimes depressing. For example, you made a backup on one hosting to the cloud, and then decided to raise it on another hosting. It's a cloud. But if you have connected the paid option that you have distributed data storage in many regions, then you may be unpleasantly surprised by the speed of recovery in the neighboring US state. If the backup is hundreds of Gb, then it will be several times faster to make a backup to a local disk, copy the backup over a dedicated channel and raise it on another hosting.
Small changes in performance may occur, which are difficult to explain with something like the work of neighboring virtual hosts that share the physical host of the provider with you. If you worked with virtual hosts on a local network, then you probably came across how loaded virtual servers on the same physics can influence each other. In your network, you can influence this or make measurements by taking out virtual servers from a physical server.
If you are hooked on the clouds, then you depend on this cloud provider, and this is another point of failure in your information infrastructure. For example, Alibaba turns off resources without looking, as soon as 12 o'clock strikes and something turns into a pumpkin due to the fact that there are no pre-added yuan on your account in the cloud.
In Russia, the situation is added to this that not so long ago a large-scale blocking of foreign resources was carried out, a lot was blocked, including the IP addresses of cloud databases. This is reality and it can be in many countries.
In connection with the sanctions applied to Russia, domestic companies have questions about storing data in the clouds of American companies from the top. There is no guarantee that tomorrow all your data will not be lost along with the disabled account. Disconnecting cloud accounts is quick and easy.

In Russia there is Yandex.Cloud, SberCloud, I honestly did not use them in terms of a database. There was an experience of using other Yandex services, which were then transferred to the cloud, changed protocols and made paid. So far, they are not interested in paying money, since there are other suppliers like Microsoft, Google, which have free quotas for small volumes and there are a number of other advantages.

Summing up: the transition to the clouds is going on in the world a little faster than in our country, I know entire companies in the West that have transferred almost everything to the clouds and this is their strategy, but the proportion of those still far fewer have not done so. To illustrate, here is a graph showing the compound annual growth rate.

If you need to quickly deploy functionality anywhere in the world, then there is no alternative to clouds, but be prepared to pay bills later commensurate with the resources spent, which you did not even know about. You need to calculate in advance spending or try in reality. And security in the clouds is not yet convincing.

In Russia, the prices for cloud storage are lower than in the world, but foreign companies do not seek to store data here, so they develop more often at the expense of government customers and a small share of companies.

OLTP vs OLAP

The data in the database can be used to conduct current business operations Online Transaction Processing (OLTP): find a client, issue an invoice, pay for a resource, write off the balance from the warehouse when ordering goods, etc. Almost all of these operations must be carried out in real time. If a user on a website or in an internal corporate system expects simple operations for several seconds and this is a problem with the database, then there is something to optimize. OLTP - originally designed for real-time business. If the company has databases, then there is OLTP.

There is data that is used to analyze the work of Online Analytical Processing (OLAP). That is, large amounts of data are collected for OLAP and in order to quickly calculate them in any aspect, simple magic is needed to predict everything that a business is most likely to need. That is, if you want to know the number of clicks on your global website by country or page, then you need to calculate them in advance, and even making this grouping by time, so that you can then watch the dynamics over time, compare with historical trends. And OLAP storages can be non-relational and generally not structured, use specialized languages ??for managing large amounts of data, or languages ??for statistical data processing. Recently, it has become fashionable to refer to ordinary business analysts as Data Scientist. This is not entirely true, but the term has already caught on. Usually it is a mixture of the following ingredients: SQL, Python, R, frameworks for working with neural networks, mathematical models of various kinds, etc.

The number of OLAP databases is usually smaller in quantity than OLTP, but their size is larger. For OLAP databases, multithreading support is important, when a query is parallelized between cores and each core does its part. If your OLAP DBMS is able to shard on many servers, works well with multithreading, supports all the latest SIMD (single instruction, multiple data) processor instructions, when large data packets are processed in 1 operation, then the data processing speed increases multiple by all these factors.

The general trend is that analytics are separated from the data required for operations. The removal of analytical data can usually consist of the removal to separate servers where you can read anything without affecting the work of the company. For calculations of neural networks and other heavy mathematical models, clouds are used with a cloud computing service on specialized boards, for example, NVidia Tesla.

SSD vs HDD vs Storage vs Tape vs Other

This part is about which custodians to store data for the database.

In 2020, there is no doubt that SSDs are winning the battle against HDDs. In server systems with a database, this understanding came much earlier than anywhere else. The thing is that in most types of databases, it is not sequential reading that is important, but reading into memory from different places from the disk. And the same random entry for data. SSDs have no problems with this, while the speed of access to random disk space on the HDD is achieved by the spindle rotation speed and the speed of movement of the reading mechanism between tracks. Try to copy several dozen files to the HDD at the same time from different locations, the speed quickly degrades to unacceptable values. Likewise, requests for data from 1000 users that lie in different places on the disk will quickly negate the speed of any HDD. Therefore, it makes little sense to use HDD for OLTP operating systems. In the picture below, there are ordinary SSDs with 6000 IOPS (read and write operations to the disk per second), in server solutions, especially with NVME, there are much more, but it is worth separating the marketing figures for short measurements that get into the cache from the real work of the disk in this mode around the clock.

It makes sense to use HDD in OLAP systems when the data is sequential and needs to be read and written only in this way, or it makes sense to use it for data backups, this is a large sequential write and read. Also in large archival databases and wherever the cost of storing 1 GB is the decisive unit. HDDs are cheaper than SSDs at a cost of 1 GB.

In terms of fault tolerance, SSDs are better than HDDs when viewed as separate devices. This is personal experience on thousands of copies. SSD failures are much less common than HDDs, but you need to understand that these are statistics for server models, many of which were produced according to SLC and MLC standards, which are more expensive, allowing you to rewrite data much more times than the currently promoted TLC and QLC, which are not recommended for databases ... For server systems where databases are stored, disks and components with increased fault tolerance are used. An SSD drive of 1Tb and a cost of $ 1000 is a normal situation for a database. They have the ability to work for months at the limit, not only reading a lot, but also writing a lot, without overheating or dramatically slowing down. There was no picture comparing the fault tolerance of server SSD and HDD, but there is about the usual. SSDs fail less often.

The SSD form factor is a 2.5-inch hot-swappable device, PCI-X cards, U.2 is the server counterpart of M.2 found in desktops. The modern SSD protocol is NVME.

Storage – Data Storage System (DSS) are external data storages that are connected to servers via fiber optic or network interface. The storages are placed in the same server racks as the servers and are connected to them. Storage is another huge layer of information, which is enough for 10 articles. Specialized equipment for data storage. Their main purpose is high fault tolerance, increased data processing speed. Data storage costs start from tens of thousands of dollars for advanced versions and this with a minimum set of disks. The upper bar is not limited, it can reach millions of dollars and more. Modern storage systems can have words like AllFlash in the name - which implies the rejection of the HDD in them and the internal algorithms and code are optimized only for SSD.

After the acquisition of EMC, DELL has consolidated its position in the enterprise storage market. Huawei is growing before our eyes and is becoming a prominent player despite US sanctions. Russia does not have its own world-class data warehouses, all significant market players simply re-label finished products with their own brand or assemble their own version from parts of well-known manufacturers or vendors.

Intel Optane (3D Xpoint) – a specific type of non-volatile memory, the fastest at the moment for random reading, but there is no such obvious advantage for random writing, and loses to top SSD in sequential read and write. It did not evolve due to the high price of drives and the lack of large drives. So SSD + NVME provide the best price / performance ratio. For the price of Optane, you can buy several SSDs, which in RAID will give more speed.

RAID – It makes no sense to repeat what it is for combining disks into arrays, for speed and for fault tolerance. You can read it here. It depends on what task you solve, that RAID is used. For OLTP databases, RAID10 is most common.

Tape – tape storage of data. Many will be surprised, but in 2020 the tapes are still alive. New versions of tape cartridges come out, huge storage libraries for hundreds of cartridges are released. This is because the cost of tape storage continues to be the lowest. Storing tape does not require electricity, the storage time is longer than disk storage, but the access speed is very slow and this is sequential read and write. There is a suspicion that in the clouds, the coldest archival data stores can use tapes.

DBMS oportunities

Any features of a DBMS are competitive advantages that play a role when choosing to deploy it in your infrastructure. I will not consider all understandable topics such as the error-free operation of the DBMS, support for different character sets, sorts for any country, etc., as it should be better obvious. Also, the cost was mentioned above. The conversation will be about important and popular technologies and parts of the DBMS. In fact, there are more of them.

Horizontal scaling out of the box: Horizontal scaling vs Vertical Scaling.

This is what has largely determined the emergence of many types of databases and DBMS in the last 10-15 years. If in the early 2000s Oracle, Microsoft, IBM conducted a backward agitation and called for the unification of disparate data from many branches of companies into a single center where there is a powerful server with data and everyone works remotely with this data, including the emerging corporate sites, Web API, mobile clients , then already at the end of the 2000s, with the explosive growth of data, it became clear that vertically scaling (buying an ever larger server) was already too expensive or no longer possible. Resisted the number of CPUs, disks, network, connections, etc. for central infrastructure nodes. Therefore, solutions have appeared that allow you to distribute data to multiple database servers.

A large role in this was played by social networks and search sites, which quickly gained the number of users, their content grew exponentially, especially after adding multimedia or video files. The geographical distribution of servers is also important, when data on clients is stored on servers that are closer to it geographically and hence the response time is faster. Initially, the distribution of data to many small servers was done at the application level, then such capabilities were included in the DBMS engines that were completed for themselves, and then it became a trend.

Don't think horizontal scaling will solve all your problems. On the contrary, it will add complexity to everything and everyone: code, infrastructure, bug finding, and much more. And this is also all the money. But you no longer need very powerful and expensive servers, your project can survive heavy loads. Individual nodes storing data are often called shards, and the process of distributing data and requests between nodes is sharding. If you choose the right sharding key, then requests for data go only to the shard where they are. Both the application itself and the DBMS engine can take over the distribution of requests. Below in the picture is an example when the hash function is used for sharding to determine which server to use. And each shard can have copies (replicas).

At the moment, it is not so easy to buy new 8-processor servers for peak performance, their number is very limited, they are not needed by the market, they are superseded by 4-processor servers, which are cheaper and not twice, but more. And if you take a real example, then a 2-processor modern server in terms of computing power is comparable or superior to an 8-processor server 10 years ago. In addition to processors, all server components have been accelerated: memory, bus, etc. The same query will run 2-3 times faster if all data is in memory. DBMS are very good at using kernels and parallel query execution.

With cloud providers, with an increase in the number of cores in a virtual server 2 times, the price grows faster than 2 times and there is a limitation on the number of cores, because the provider is not profitable for virtual machines that need to issue most of the physical server, what to do with it when you do not load it, the power is idle.

Bottom line: for really large projects you will need horizontal scaling, if you have not grown to this, then it is better not to complicate the services and the database by spreading it across many nodes, unless the DBMS supports this completely transparently for applications. It's easier to carry out actions in the RAM of one server than to drag scattered pieces of data from different nodes and combine them to get the result.

Fault tolerance out of the box - High Availability. Master-master, master-slave.

High Availability is a term when your project should always work. Any downtime in a minute threatens with direct or indirect losses. In fault-tolerant systems, this is achieved by duplicating nodes. This is done in the space industry, preventing the situation that equipment failure would be costly in space, and so it is done on earth with an important system in production. Important nodes of your infrastructure are duplicated: servers, data storages, network switches, Internet channels, etc.

For the database, servers are duplicated, which run the DBMS as software, data is duplicated, in the simplest case these are RAID arrays of duplicate disks. In specialized storage systems, disks, power supplies, processors, memory are duplicated, there is a battery that will make it possible to save data from the cache to disk.

Also, for fault tolerance, online copies of data are created that allow you to keep up-to-date information. Data is synchronized to a copy synchronously or asynchronously. Synchronous is when the change operation is confirmed by all nodes, which leads to delays in saving, or asynchronously, when data is saved and then propagated to another location separately.

It looks something like this. There is one database with data, they diverge to other servers, possibly in a remote data center. Slave copies can be used for reading, requests for data can be directed to copies. It is called Master-Slave.

But sooner or later a situation will begin to arise that your database does not have time to write due to the total load. I would like the data to be written to another copy, less loaded, it would not be read-only. This scheme is more complicated, since you still need to resolve conflicts if changes to the same data occur on different nodes at the same time, and if your master copies are far away, then the probability of conflicts increases. This is called Master-Master. Plus, master can have slave copies.

Duplication of data to other servers can be either across the entire database or on separate tables.

Well, when your DBMS supports such data distribution, you do not have to bear the burden of problems like fixing the data change in one place to make the data appear in another. And if the network blinked, and if the copy does not respond, and many more if the engine takes over. If master becomes unavailable, then an automatic redistribution of roles occurs, yesterday's slave becomes master, and all other copies receive data from it. The application does not notice the switch, because it works through a router that also switches. Your server with the database has broken down, but there is no downtime, everything continues to work as if nothing had happened. You can also work with a database instance by turning it off, install a DBMS update and then put it back into operation.

Advanced storage systems are able to copy data to another storage in the background at the disk level. You just have a copy of the disc elsewhere. In some cases, you can use such copies, but usually they are not available even for reading until the synchronization stops.

Online maintenance - online alter

24/7/365 - means that your project is always running 24 hours, 7 days a week and all 365 days. You do not have a window for maintenance work. This means all operations for creating backup copies, rebuilding indexes, creating tables, deleting columns and much more, which should take place online without a noticeable degradation in performance. That is, while you are rebuilding a table, for example, by deleting a data column in a relational database, the table will be available, and a copy will be created that will contain all the changes while the rebuilding process is in progress. It is not always possible to have many copies of servers with a database, for paid DBMS it is also money to carry out work in turn, so the ability to change structures without interrupting work is very important.

Monitoring

Every heavily used and modified database will sooner or later face performance issues. But how to understand that the problems began beforehand without causing companies to lose money? You need to monitor the status of your DBMS. If you can measure the current state, or even better, look at the statistics that the DBMS itself collects, then you can solve many problems out of the box. Paid DBMSs have an incredibly huge number of metrics and statistics, there are third-party companies that create monitoring tools. The market is very competitive and each instrument has its own characteristics. Free DBMS in recent years have greatly improved in terms of monitoring and they also began to create monitors for them, only often they are no longer free.

DBMS management tools

No matter how great a DBMS you have, its management should be convenient. Of course, everything can be done from the text command lines supplied with the DBMS, but many things are more convenient and intuitive to do from the interface, moreover, you do not need to remember the full command syntax by heart and you will not be mistaken. The handy tool has plugins from third-party companies, for example, advanced typing hints, they help a lot and speed up writing code, you can also click on tooltips and see the source code for tables, procedures. Another very important function is the execution of queries on multiple databases at the same time, you have configured connections for hundreds of servers and databases and you can propagate the change to them very quickly by working in one window. More interestingly, there are tools to help you view the execution plans of queries on the server side with hints about which indexes to build, which will dramatically speed up the query.

Or for some DBMS there is a great opportunity, which is in demand among those working with large data, when the% of query execution is visible. You can see when the request ends and return the result.

Another interesting thing: data scripting is the creation of SQL statements that will create a copy of data on another server, data migration, data structure comparison, data comparison, export to other formats, version control systems and product environment updates, etc.

There are tools that are used to manage different types of databases, for example, JetBrains' DataGrip (the very ones involved in Kotlin, ReSharper, GoLand, etc.) are very powerful and customizable. Picture of the DBMS he works with.

Extending the functionality of the DBMS in another programming language

No matter how powerful the DBMS is, sooner or later the built-in functions may not be enough. For example, some new kind of hashing or data compression that needs to be supported at the database level. If in programming languages you download the necessary library, source codes and you can insert a call to this code and you do not need to invent a bicycle, waste company resources, then the same approach can be implemented in a DBMS. You can extend the functionality of the DBMS out of the box with your plugins and they will work just like built-in commands. Moreover, often such extensions can work faster than the embedded code of the DBMS itself. Of course, there are a number of restrictions that are imposed on such extensions, but the very fact of being able to write code in common programming languages is an important plus.

Logging changes

An important question is the question, what happened to the data or database structures at a certain moment ago. Changes logging is useful when you change structures or data, but you need to return the data itself, or table structures, indexes, or SQL code in queries. You will know that at such and such a moment it was like this. It also protects against data destruction, most often unintentional. Each DBMS has a different name for the technology, for example Flashback Data Archive, Temporal history, Change Tracking, Data Audit, etc.

If your DBMS knows how to simply enable or disable such logging, then this is something that will definitely come in handy. But miracles do not happen, so if you start logging, then bring the load, for example, disks are loaded, the size of the logs grows, they can be in the database or separately, if the database has a copy, then saving to the database diverges by copies.

Business logic in the database or not

Just as the database stores its structure in itself, so the query code can be stored in the database. Query plans are saved the first time they are called, in order to find data on these plans later. It is easier to make changes to the database language code, it is compiled on the fly, that is, you can make a change a few tens of seconds after deciding to change, whereas if you are going to publish a service update, then you need to compile it, achieve the correct update for all instances this service. This helps a lot in critical situations when seconds are counted and you can stop the flow of load on another infrastructure with just one change in the database.

Business logic is a set of rules that affect the entire information structure of a company. There are several approaches to storing business logic. For example, at the application level (services, micro-services) or if the business logic is in the database, then, in addition to storing data, it also intelligently decides how to return data, how to apply filters, combine data from different sources, etc. This provides a number of advantages in terms of flexibility and speed of change, but the load grows and, as a result, is limited by performance. Or, for example, you decided to prohibit editing the data of past years, this rule is easily implemented on the database instead of tracking all possible ways to do it from services, programs, sites, etc. The more ways to manage business logic on the database, the better, you will have different ways to solve a business problem.

If the database is loaded, then the business logic is taken out to the service level, the results are cached in all ways, etc. The database is used only for fast data retrieval.

Not all databases allow storing query code in a database, for example, ClickHouse does not even have a full-fledged scripting language. The company decides on its own how to organize the storage of business logic, there are no universal rules.

JSON support

The most downloaded NuGet package in Micrsoft Visual Studio for C # is a library for working with JSON (JavaScript Object Notation). This example shows that if something is in demand, it will make its way wherever it can even at Microsoft, which has historically developed XML. Although storing in JSON is against the rules of relational databases, the reality is that there is too much data in JSON in the IT infrastructure and support for this format is being inserted into different types of DBMS.

In Memory

RAM is faster than any hard drive. Therefore, to maximize the speedup of operations, including with the database, you need to use memory to the maximum. I will say right away that in all DBMSs they try to use the memory to the full and a database of any type works faster if there is a lot of memory on the server. Often, writing data to the disk is delayed and does not occur directly during the write operation. There are many technologies and they are different in different DBMSs or they can be named differently to reduce the number of disk accesses.

Some DBMS support the In-Memory feature as an option, and some declare this feature as the main one, for example, Tarantool.

Data compression

An important parameter that will save on the size of disks and server memory. In addition, compression speeds up the operations of reading data, saving copies of data, etc. For example, for OLAP storages, this means faster retrieval of query results over a huge amount of data. You need to understand that compression is not free for resources, but there are more pluses than minuses, compression algorithms are used fast, which do not heavily load the CPU. Usually, compression is set at the DBMS level and does not affect the operation of the software in any way, that is, nothing needs to be changed in the code.

Temporary objects

The complex logic of obtaining data can go through several stages, when separate sets are formed, which are further involved in the subsequent selection. It is not always necessary to save temporary sets into full-fledged data, they can be stored in memory while there is a connection from the client and you do not need to think about freeing them after disconnection. These can be variables, tables, temporary tables that are visible to other connections. The DBMS tries to keep temporary objects in memory because their purpose says that they will be quickly used and destroyed, there is no point in committing their changes to disk.

MapReduce

Under this well-established term from Google we will denote a class of distributed computing problems. The name comes from the two steps Map - distributing input data between distributed nodes and Reduce - getting results from distributed nodes and forming the final result. Representatives of Apache Hadoop and Spark are a whole set of libraries, HDFS, and much more. An example of a DBMS for working with such frameworks is Hive, a relational DBMS with SQL support. Trends.

It should be said that these technologies are not used in operational databases. They are used in specialized data warehouses when it makes sense to scatter data and carry out distributed computing. For most companies without petabytes of data, ordinary DBMSs with their distribution of computations between hosts, parallelism, etc. are enough. According to the graphs and the downgrade of Hive's rating, we can say that interest in these technologies at least has not been growing in recent years.

Working with spatial data

If you need to find objects in the real world, and most often it is the task of finding the nearest objects in relation to a certain point in space. But how do you search for such data in relational data? In principle, nothing prevents you from making your own ways of how to search for the nearest points in any kind of database, how to prepare data so that the search is fast. Database developers, too, seeing the demand for such searches, have added technologies for spatial indexes in the form of grids, or you can often find the implementation of indexes using an R-tree tree.

Graph data

Ability to work with hierarchical or graph data. Above was a description and examples of what graph representation is often used for.

Safety

Typically, databases are located inside a protected network loop. And no one directly opens ports to the outside. If you nevertheless somehow opened access to the server, then do not doubt that they will first probe the ports, if there is a database of a known type, and then try to break it. Information has value, so opening access to the database servers outside is like putting a safe with money out the window - a strange option.

In the databases themselves, there are a number of technologies that can reduce the risk of accessing data. This is encryption of backups, when even by stealing a disk with database copies without a password, nothing can be done with it. You most likely cannot simply read the files of a working database; the DBMS gets exclusive access to them. Each DBMS has a differentiation of user rights, groups in which users are included, rights to various operations on the DBMS server, including reading data, writing, changing structures, settings, etc. A huge number of types of access for any tasks in real life. Access can be distributed, taken away, explicitly denied access to certain resources. Security issues are a huge and very interesting piece of information not for this article. Do not forget to install updates, as vulnerabilities are found and closed by DBMS manufacturers. Conduct a third-party audit of IT infrastructure security by trusted companies. You will be unpleasantly surprised when security auditors (pentesters) get to your data, although more often it will not be because of security problems of the DBMS itself or your database.

GPU usage, NPU (Neural Processing Unit), Google TPU (Tensor Processing Unit)

At the present stage of database development, there are no massive uses of graphics and specialized processors in DBMS engines. Yes, GPU and NPU are used for mathematical calculations, training neural networks, but the size of the GPU and NPU RAM is smaller than that of conventional servers, and the task of fetching or updating data (the most frequent in the database) does not require huge computing power. Data from the database can be fed to the input of specialized frameworks working with neural networks for further iterations. The DPU (Data Processing Unit) is a class of non-standard processors usually integrated into network cards. Their future is still in question.

Community

A large community of like-minded people using the same software will provide you with answers to questions that may not be so trivial. The first thing to look at in case of an incomprehensible error is similar questions with the same error text or a description of the situation. There are many sites with reputable authors for different DBMSs. Nevertheless, I will give statistics from StackOverflow.com how many questions there are on the top databases, for each there may be several answers. Surely your question will be with a solution if the community is large. Such is the accumulated KnowledgeBase.

For the big picture, the image of the links of the database, frameworks, programming languages, platforms taken. We do not look at the% use - this is always a reason for holy war (holy war). Here the connections are more interesting, what is more often connected with. DBMS are marked in red. Where Oracle and IBM DB2 have gone is a mystery on the conscience of the charts.

To summarize: All DBMS from the top have won this place in the process of natural selection and have their own piece of the market. OpenSource wins Commercial DBMS, in Russia the process has accelerated in this direction. Online DBMS projects are growing in number, taking away a share of the revenue from the offline business. Relational databases will not give up their positions in the near future and will prevail. The SQL language will dominate for database management. Flash memory is used to store data from operational databases. The more capabilities and features a DBMS has and the more people use it, the easier it is to integrate it into the infrastructure and maintain it. The transition to database clouds is proceeding fragmentarily; in the coming years, most of the data will be stored locally. Russian database technologies are not particularly noticeable on a global scale, but there are some and we wish them success.

Source.