EXECUTION TIME - A framework for implementing a scalable business intelligence system

A framework for implementing a scalable business intelligence system 64 The ETL system's source code is version controlled and stored on GitHub²³. As a result, the tests were additionally automated by constructing GitHub build pipelines²⁴ that would execute the tests whenever a new update was introduced to the repository.

Performance testing is the next set of tests. The major performance test was to see how much data could be loaded into the PowerBI reports and how long each load would take. When in development, the Tag and Value queries read data from csv files, and when in production, they read data straight from MongoDB.

The calendar query, on the other hand, builds and calculates the calendar. Thus, there was more room for improvement on this query. To evaluate the speed of the calendar query, a one-year calendar divided into one-minute intervals was produced. This would result in 525 600 separate calendar entries.

The queries were run on a Lenovo Legion Laptop running Windows 10 with 16GB of memory and a 2.4GHz i5 CPU. The execution time of ten successive runs in Visual Studio Code for each method were recorded.

Figure 3-14 depicts the average execution time.

Figure 3-14 Average execution time, for calendar query

23 What Exactly Is GitHub Anyway? Source: https://techcrunch.com/2012/07/14/what-exactly-is- github-anyway/

24 Creating a CI/CD pipeline using GitHub Actions. Source:

https://medium.com/@michaelekpang/creating-a-ci-cd-pipeline-using-github-actions-b65bb248edfe 11,23

3,07

0,185 0

2 4 6 8 10 12

Apply Vectorisation Dask

A framework for implementing a scalable business intelligence system 65 The first solution used the apply method²⁵. This method ran for a longer period than expected, consuming a large amount of memory over time. Vectorisation²⁶ was then used to shorten the execution time. The execution time was reduced to just over three minutes.

However, the query still consumed a significant amount of memory and would occasionally crash with a memory error during execution if there was insufficient memory. The Python Dask library was thus considered to optimise the query even further. Dask is a python library that is commonly used for parallel computing²⁷. As a result, the query time was reduced to slightly over ten seconds.

Following the completion of the performance testing, a selected number of business users were given early access to the template to prepare reports.

Users were also shown how to use the system during training sessions. The users requested a few adjustments to the templates, such as the addition of more columns in the calendar table of the template as well as the inclusion of the financial year for the custom billing dates.

After these adjustments were made, the users tested it for the first time. Once the users were satisfied, the new BI system was finally deployed.

As the system was being used after the initial deployment, several improvement opportunities were discovered. As a result, the implemented BI system was refined. These enhancements are discussed in greater detail in the following subsection.

IMPROVEMENTS

For starters, the BI system was reliant on Python because it was chosen as the method of importing data into the BI reports. This meant that, the user was required to have a specific version of Python installed on their machine, as well as specific packages. Thus, creating BI reports necessitated a significant amount of pre-installation work.

To address this issue, a cloud Application Programming Interface (API) was created and used to populate the BI reports with data. An API²⁸ is an interface that allows data to be transmitted from one software product to another. It also includes the conditions of this data exchange.

As a result, users no longer needed to have Python installed on their local machine, but only needed internet access. Because no additional tools had to be installed beforehand, the process of developing BI reports was streamlined.

25 Pandas Data Frames. Source:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

26 Vectorization in Python. Source: https://www.geeksforgeeks.org/vectorization-in-python/

27 Dask. Source: https://dask.org/

28 What is an API? Source: https://www.altexsoft.com/blog/engineering/what-is-api-definition-types- specifications-documentation/

A framework for implementing a scalable business intelligence system 66 In addition, as shown in Figure 3-5, the user must manually enter the report parameters on the BI report. The problem with this was that, if a user needs to change the start date or add a new tag id, for example, the user will need to download the report, change the parameter, and then republish the report, which was time consuming. As a result, a menu was created to allow users to configure datasets online. This menu is depicted in Figure 3-15.

Figure 3-15 Dataset Menu

The user can use this menu to specify the dataset name and the data source, which is a collection of tag Ids. The menu also allows the user to specify calendar options such as the start month of the financial year, the billing calendar, the calendar interval, a start date, and an end date.

Once the report parameters have been specified online, the user only needs to provide the client group and dataset Id in the report, as shown in Figure 3-16.

Figure 3-16 Updated report parameters

The API would then use the dataset Id to retrieve all the report parameters required to pull the correct data. After several iterations of improvements, the BI system was redeployed.

A framework for implementing a scalable business intelligence system 67 Several training sessions were held following the deployment of the BI system. These training sessions were videotaped and saved in XYZ consulting’s common repository, where all employees can access them. There is also a support ticket system in place where users may submit feature requests and report system defects.

In terms of system maintenance, in addition to the refresh icons, several flags were also added to the system to alert the user when something is wrong with the report. Furthermore, the BI system defines access roles in addition to flagging. A report has an owner as well as contributors. The report owner can change the report configurations, assign contributors, link the report to live views, publish a new revision, and delete the report.

Report contributors, on the other hand, can only publish a new revision. The .pbix file can be seen and downloaded by other users.

Monitoring systems were also added to monitor the memory and CPU use of the PowerBI gateway, which oversees refreshing the reports with data. Finally, data backups are required to ensure that the system is not jeopardised in the event of a data breach.

As a result, a mechanism exists to produce a daily backup of the data warehouse. This data is also utilised as development data by restoring it into the development database via the restore process. This enables continuous testing of the backup and restore functions.

Dalam dokumen A framework for implementing a scalable business intelligence system (Halaman 71-74)