Features are important, but so is the integrity of the platform you're using. Performance, availability, and security are all crucial, so we've compiled some key information about the Mews platform.
Mews is a cloud-native, serverless, multi-tenant SaaS platform. This has many implications across all aspects of how Mews operates and how it fulfils various requirements and compliance. It’s important to look at Mews through the prism of this philosophy while reading other sections of this documentation, because it might be different from more traditional systems, and some requirements or questions are not applicable when considering the Mews platform.
From day one, Mews was built for the cloud. The system architecture was designed for cloud deployment and it utilizes all of the benefits this brings. Mews is not a system that was designed to run on-premises and later ported or adjusted for cloud deployment, which means that some processes or procedures work differently or are not happening at all.
There are multiple ways to operate within a cloud environment. On one end of the spectrum, you could be using only the low-level cloud services like virtual machines and handle everything else on your own. The advantage of this is that all cloud providers offer such primitive functionality and therefore it’s rather straightforward to switch them – although you need to have expertise on how to configure the servers, databases etc. and continually maintain it on your own.
On the other end of the spectrum, you could go beyond low-level functionality and use the cloud as a service, e.g. computing or storage services. That way, the cloud provider takes care of the configuration and maintenance – the disadvantage is that you are becoming more locked-in to your specific cloud provider.
We are a proud partner of Microsoft and use Microsoft Azure to its fullest potential. Therefore, we’re on the serverless side of the spectrum. We use services like Azure App Service for web application hosting, and Azure SQL Database and Azure Storage as storage services. Therefore, we don’t operate any virtual machines, web servers or database servers on our own – that is the responsibility of our cloud provider, including the compliance and security of such services.
These services have their own SLAs defined by Azure. We built our solution on top of that in a way that combines their services and SLAs together with our system, to guarantee our SLAs. We use the same technique to guarantee our compliance, security, disaster recovery and other aspects covered in more detail in further sections.
There is a single production “installation” of Mews platform that all of our clients use. That means our clients are always running on the latest version of the platform, with the same features and functionality available to anyone else on the platform (depending on subscription level). From a data perspective, data is not segregated, the storage is shared.
From a security perspective, it is actually very similar to a single-tenant system. There, we’d have to ensure that users with different privilege levels can access only the data they are granted access to. On multi-tenant systems, the tenant can be understood as another “layer” of privileges. Having a multi-tenant solution allows us to effectively implement above-enterprise or above-chain scenarios and deliver great guest experience, especially in the guest portal.
The only thing you need to use Mews platform is the internet and a web browser. Everything else, we take care of. We handle all the aspects that are covered in this documentation, and we strive to do them as well as possible while continually improving. It is our responsibility to ensure the system is fast, always available, has backups, is secure, complies with all legislations, is always up-to-date, accessible for everybody all over the world and usable for a wide range of users.
We use Microsoft Azure as a cloud provider, and we utilize the following services:
- Azure SQL Database for storage of relational data
- Azure Storage for storage of binary data and system assets
- Azure Cosmos DB for storage of non-relational data
- Azure Cache for Redis (remote dictionary server) as caching storage
- Azure App Service for application hosting
- Azure DNS for domain management
- Azure CDN as a content delivery network for images and other assets
- Azure Traffic Manager for DNS-based load balancing
- Azure Automation for process automation
- Azure Application Insights for telemetry
- Azure Cognitive Services for AI services
Some of these services are global with geo-replication and high-availability built in by Azure; some of the services are bound to a single region. We have two, fully-operational regions: the primary region in West Europe, and the secondary region in North Europe. Our primary database is a high-availability cluster within the primary data center, with replicas in the secondary region. We have App Services, SQL Databases, Redis caches and Cosmos DB in both data centers, and the rest of the services are shared.
Besides the above, we use other third party services for various purposes:
- Rapid7 insightOps for logging
- Sentry for error reporting
- NewRelic for performance monitoring
- PagerDuty for incident management
- Firebase for push notifications
- Google Analytics for analyses of user behavior
- Sendgrid as an emailing service provider
- Hotjar for analyses of user behavior
- GitHub as a source control tool
- Azure DevOps as a continuous integration pipeline
- Zapier for system integration
- Browserstack as a testing tool
- Statuspage for public system status information
The Mews platform runs in multiple instances called environments; some of them are public, some of them are private:
Our philosophy when it comes to deployments is to deploy as often as possible and the smaller the deployed change-set, the better. The main reason is that this helps us to deliver finished features and fixes to our clients as soon as possible so that they can benefit from them immediately. The secondary reason is to minimize problems during deployments and simplify investigation and rollback in case of any problems. All of our deployments cause no downtime for the end-user.
Our platform is not a single application that we would deploy en-bloc, but rather an assembly of systems and applications that work and communicate together and that have their own deployment schedules. There are three main categories of applications with respective deployment schedules:
- Backend platform (server) is deployed at least once every weekday. On top of that, if necessary, ad-hoc deployments can be done for various purposes, e.g. hot fixes. The standard scenario is that all changes (features, bug fixes, improvements) are being continuously deployed to the development environment. Once a day, we automatically take a snapshot of the development environment version and deploy it to the demo environment (this is called feature-freeze). The next day, if there were no problems on the demo site, that version is deployed to the production environment. All deployments happen gradually on all instances and regions. During the process, we monitor the system and in case of any issues, we’re able to rollback the process.
- Web applications (e.g. Commander, Distributor or Navigator) are deployed independently whenever a change is finalized in the application. That means whenever a feature is implemented or bug is fixed and passes quality assurance, it is immediately deployed both to demo and production. This is true continuous delivery which means there might be 50 or 0 deployments in a day, depending on how many changes are finalized on that day. Again, we monitor the health of the applications and we’re able to rollback any process if necessary.
- Mobile applications are deployed irregularly, due to verification processes in application stores. Once in a while, when we determine that the set of finished changes in the development version of the application is reasonably big, or when necessary for other reasons, we release a new version of the application. It's then published to the respective application store, where it goes through the verification process. After some time (hours or days), the new version reaches end-user devices.
We reserve the option of scheduled downtime necessary for system changes, although our goal is to never use this option. So far, we had to use it only once in 2013 when we were migrating our cloud provider from AppHarbor to Microsoft Azure. Since then, we have had no scheduled downtime.
It's important to distinguish deployment and release. Deployment is the moment when the change reaches the production. However, the change does not necessarily need to be available to all clients. The moment when the change is available to a client is called a release. Smaller changes, bug fixes, improvements or other non-critical things are released to all our clients as soon as they are deployed. However, for bigger or more critical changes, we stick to the following 4-step release process:
- Internal alpha: The change is released only to Mews employees. We use Mews internally as well, therefore we are the “canaries” who test the change.
- Private beta: The change is released to a selected subset of clients who form early-adopter groups. If the change is particularly important for someone who was involved in the product discovery and delivery process, they might be included in private beta as well. For this step and also for internal alpha, we use LaunchDarkly to manage the set of impacted clients.
- Public beta: The change is released to anybody who opts-in to the change. Usually, we introduce an option in settings that allows anybody to opt into the change.
- General availability: The change is released to all our clients.
As a cloud-native system, our disaster recovery strategy revolves around data backups and the capability to restore them in case of an incident. All other services are “stateless” which means that in case of disaster, we are able to restore them without any loss of information. We heavily rely on features that Microsoft Azure offers in this area, plus we have our own levels of backups built on top of standard Azure features.
Azure SQL database
We use the premium tier of Azure SQL database with a replica in the secondary geographical region. This setup already has several backup layers and mechanisms out of the box, described in full detail here. On top of that, we have our own backup processes. All of the options, both built-in and ours, are described below:
- Within a data center, the database service runs as a high-availability cluster of two identical replicas of the database with near real-time data latency (low milliseconds). In case of disaster to the primary database, the service immediately fallbacks to the secondary database. Alternatively, we are able to trigger this fallback manually.
- The database service offers point-in-time restore which enables us to restore a complete database to a particular point in time up to 35 days back.
- The database cluster in the primary region is geo-replicated to the secondary high-availability cluster in the secondary region. In case of disaster in the primary region, we are able to perform failover to the secondary region and promote the secondary replica to master. We can do that fast and reliably using auto-failover groups.
- We perform daily snapshots of the primary database using the point-in-time restore capability to another backup server. The backup server holds two fully restored copies, at most 24 and 48 hours old, ready for immediate usage in case of disaster affecting both the primary and secondary database. Alternatively, these snapshots may be used in case of partial data corruption to restore the data immediately.
As a store for binary data, we use Azure Storage configured to use geo-redundant storage capabilities. The data is automatically replicated three times within the primary region and three times in the secondary region. The storage account also uses a soft-delete feature which prevents application specific issues and allows us to recover potentially corrupted data. Similarly to SQL database, we perform daily incremental backups of all data in the storage into backup storage that is ready for immediate usage in case of disaster affecting the primary storage.
We have Cosmos DB configured to be replicated into multiple regions. Cosmos DB transparently replicates the data to all regions associated with our account, and supports automatic failover in case of regional outage. Currently, we store only non-business-critical data to Cosmos DB (e.g. logs) and therefore we don’t have any additional layer of backups built on top of offered features of the service.
The Mews platform works with very sensitive customer data, therefore security and data privacy are non-negotiable elements of the system. Our general approach in this area is that nothing should rely on people or their knowledge. All our security measures and internal processes are designed in that way; for example, while our developers are regularly trained on best secure coding practices, we do not solely rely on this for security. Our processes and frameworks are designed to prohibit the making of security bugs, or at the very least make it extremely difficult for a developer to introduce security issues into our system if it’s not technically possible to fully prohibit it. This is reflected in our security issue resolution process, which is described later. Our security strategy is governed by two main principles:
- Minimizing the attack surface, reducing its scope and complexity.
- Continuous penetration testing of the attack surface, with extensive and thorough resolution of any findings.
Besides these proactive measures, we are very often going through audits, certifications, due diligence processes and pen tests by 3rd party companies, either appointed by us (e.g. PCI-DSS, ISO) or by our prospective clients.
Minimizing the attack surface
The best way to avoid any security issues is to completely eliminate the possibility of making them in the first place. This aligns with our serverless philosophy: we are not in control of hardware, operating systems, web servers or database servers. We are not able to misconfigure of any of these systems, and we are not able to forget to apply security patches etc. – this is the responsibility of Azure, who have big security teams. We use a very limited configuration of the Azure services, for which there are options to turn on some additional security features. To ensure we don't miss any of these, we use Azure Security Advisor, which notifies us about all such options, for example when Azure introduces any new features that could harden security of our systems. Thanks to all the above, our attack surface (from the system perspective) effectively gets reduced to the application code that we develop. For more information about Azure security capabilities, please refer to Azure's security fundamentals documentation.
Continuous penetration testing
As already demonstrated, our primary focus is on application-level security. In order to ensure that our system is secure, we continuously undergo penetration testing by cobalt.io. At any given point of time, a part of our system or a product is being pen tested and we make sure that the whole surface is covered by tests in a continuous fashion.
There are multiple approaches on how to address security vulnerabilities. We take pride in our approach and address every security issue in a post-mortem manner, meaning that we perform detailed root-cause analysis and then solve not only the individual instance but all similar instances in all of our products. On top of this, we put measures in place that prevent such issues from recurring in the future. As an example, if a problem is found in one of our APIs, we update our API framework in a way that it eliminates the issue from all of our APIs. Or we implement a static code-analyzer that can check for the issue in our codebase automatically, as well as new code that we produce. So even though a single product is being tested at a time, we apply our findings to all of them.
Our approach to certifications is to judge them on a case-by-case basis in an on-demand manner. We are not proactive in this area, because even though some certifications can be helpful in learning to improve certain processes, and can provide assurance that what you do is considered best practice, we also see that some certifications have a hard time keeping up with new technologies and modern software development practices. Therefore, we only undertake the certifications that make sense to us or that are an absolute necessity for us. Currently, we have the following certifications:
- PCI-DSS Level 1
- ISO 9001
- ISO 27001