The 5 Internet Operations Factors That Matter

5
1 管理全世界的互联网服务器 云络科技 ChinaNetCloud Running the World’s Internet Servers The 5 Operations Factors That Matter Today’s Internet is a big and dynamic place, with every site, app, and service clamoring for attention, traffic, and sales. These systems are mission-critical and must be up and fast 7x24 to succeed in modern markets. As such, there are five important aspects that matter in system operations. These are Reliability, Performance, Scale, Security, and Cost Savings. Many companies focus on just one or at most two of these, not realizing how important all five are, especially if Cost Savings is continually seen as most important as this will bite most companies in the long-run. Understanding and putting proper emphasis on these five factors is key to 7x24 operational and system success in the 21st Century. With that, here are the five factors that matter most, and in addition, you’ll find some best-practices on how to improve each. A site must be up, fast, and available. Simple as that. Customers and users are busy people and use your system 7x24, especially at night and even overnight if your users have geographic diversity. Keeping the system running at nearly any cost is paramount to both economic success, such as selling products, and to projecting a high-quality brand and trust valued by consumers. You should think about: Fully redundant hardware and servers - Failure happens, in hardware and especially software. A real highly-available system has full redundancy from the IDC down, with dual IDC feeds, firewalls, switches, physical servers, VMs, load balancers, webservers, databases, caches, and everything else. You need to double down to stay up. Deep 7x24 Monitoring - Even with redundancy, things happen in your system that need to be proactively avoided or at least rapidly detected and dealt with. Deep monitoring of dozens or hundreds of data points helps find problems before they happen, and also quickly notice serious issues so they can be handled very quickly, maintaining reliability. Reliability Engineering - Moden systems should have reliability designed in, from the architecture to the hardware to configurations to operations and processes, all designed for maximum uptime and availability. Many simple choices have substantial effects on overall reliability, as do sophisticated tools such as PHP overload detectors, log analyzers, HAProxy and Keepalived. Reliability

Transcript of The 5 Internet Operations Factors That Matter

Page 1: The 5 Internet Operations Factors That Matter

1

管理全世界的互联网服务器云络科技ChinaNetCloud

Running the World’s Internet Servers

The 5 Operations Factors That Matter

Today’s Internet is a big and dynamic place, with every site, app, and service clamoring for attention, traffic, and sales. These systems are mission-critical and must be up and fast 7x24 to succeed in modern markets. As such, there are five important aspects that matter in system operations. These are Reliability, Performance, Scale, Security, and Cost Savings.

Many companies focus on just one or at most two of these, not realizing how important all five are, especially if Cost Savings is continually seen as most important as this will bite most companies in the long-run. Understanding and putting proper emphasis on these five factors is key to 7x24 operational and system success in the 21st Century.

With that, here are the five factors that matter most, and in addition, you’ll find some best-practices on how to improve each.

A site must be up, fast, and available. Simple as that. Customers and users are busy people and use your system 7x24, especially at night and even overnight if your users have geographic diversity. Keeping the system running at nearly any cost is paramount to both economic success, such as selling products, and to projecting a high-quality brand and trust valued by consumers.

You should think about:

Fully redundant hardware and servers - Failure happens, in hardware and especially software. A real highly-available system has full redundancy from the IDC down, with dual IDC feeds, firewalls, switches, physical servers, VMs, load balancers, webservers, databases, caches, and everything else. You need to double down to stay up.

Deep 7x24 Monitoring - Even with redundancy, things happen in your system that need to be proactively avoided or at least rapidly detected and dealt with. Deep monitoring of dozens or hundreds of data points helps find problems before they happen, and also quickly notice serious issues so they can be handled very quickly, maintaining reliability.

Reliability Engineering - Moden systems should have reliability designed in, from the architecture to the hardware to configurations to operations and processes, all designed for maximum uptime and availability. Many simple choices have substantial effects on overall reliability, as do sophisticated tools such as PHP overload detectors, log analyzers, HAProxy and Keepalived.

Reliability

Page 2: The 5 Internet Operations Factors That Matter

2

Modern websites must be fast, as every second they take to load they lose users. Even a few seconds can mean the difference between success and failure of a business, so every system must be highly-engineered for high-performance at every level, from hardware and networking up through services, systems, applications, page structure, and CDNs.

You should think about:

Good Engineering - Fast systems rely on performance engineering, or the use of the best design, tools, and processes to build the highest-performance systems. Good coding, code reviews, bottleneck analysis, modern techniques, limited feature use, and general app performance engineering goes a long way towards building fast systems.

Application Design - Everything is a balance, especially between flexibility to meet modern features and platforms, and making things perform well. Many of today’s tools and frameworks are very flexible and easily changed or re-purposed, but perform poorly at scale. More hardware can help, but only to a limit, so finding the right methods to make things fast for the end user is critical.

Performance Monitoring - Monitoring is a key element of performance, both inside the system at the server and operations levels, and outside from the network and user point of view. Operations monitoring includes the usual CPU, RAM, and Disk I/O, plus many service-specific issues for the web servers, apps and code, databases, and more, with each looking at key variables that really drive and/or retard performance on large systems.

System Profiling - The best systems use performance tools such as New Relic to look deep inside the code to find bottlenecks and areas for optimization. In addition, the best tools take a modular, yet holistic look at the system, from basic code profiling to external calls for key things like databases plus other services such as search, social media, and external dependencies, all of which can have a big impact on overall system performance and user experience.

Performance

Page 3: The 5 Internet Operations Factors That Matter

3

Performance is important, but truly successful sites need performance at large scale as they become popular. Many a site or system has died because it couldn’t keep up with demand, load, and success. So in addition to general performance, systems must be engineered to grow on current and future hardware/clouds, with an architecture that can easily be expanded with modern techniques and technologies.

You should think about:

Architecture that supports scale out and up - Scale engineering is different from performance engineering, in that speed is easy to get on simple and small systems, but that won’t scale out or up to many subsystems, parallel operations, etc. This is especially a problem in traditional hard-to-scale areas such as database performance, where read and write scaling have totally different dynamics and architectural solutions. Other scaling issues include session management, data caching, static asset sharing

Deep performance monitoring to find areas to improve - Basic monitoring is helpful for system operations, but deeper views of a system's performance are needed to guide scaling efforts. These include broad looks as OS and hardware performance factors and an extended view of database performance across many metrics and factors that affect performance and scaling.

Load testing - Every system is different, and most perform differently under heavy load. Bottlenecks are often in surprising places, and often easy to fix if they can be found in testing, before they impact real users. Proper load testing is quite difficult, though, and a balance is needed between the perfect test and time or resources to run tests, especially on production systems.

Capacity planning process - Full capacity planning, based on load testing, monitoring, and scale analysis helps determine how large a system can get, where the bottlenecks are, and the dynamic headroom available for surges, promotions, and general growth. Planning also creates growth models to connect business goals such as user traffic to flow models for system load and response that can be tuned to match real-world observations as the system scales up.

Scale

Page 4: The 5 Internet Operations Factors That Matter

4

Data is valuable. Often the most valuable thing for many sites and businesses, so protecting is a top priority for everyone. While security is everyone’s business, it has to be engineered into a system to be truly secure. And it needs to be there at every level as a chain is only as strong as its weakest link, often somewhere in the technology or code stack. Hackers are experts at exploiting vulnerabilities and penetrating systems, so experts are required to continually protect systems.

You should think about:

Secure architectures and code - A secure system starts with secure thinking which leads to a well-designed system and resulting code that is written with security in mind. This includes best practices, doing things right, and always working with limited permissions and assuming everything can be compromised.

Best practices including development - Doing development right is critical to a secure system, following strict processes and structures for things like prepared SQL to avoid SQL injection attacks. Advanced tools like static code analyzers are very important to high-quality code that is close to defect free. Code reviews and proper 3rd party tool use also greatly contribute.

Segregated users at every level - User separation is critical to keeping privileges separated and in user tracking or audit, to find out what system or person did what and where. And each person or application should have their own user to prevent sharing and to make thing clear at runtime when looking at user and process lists. It also helps enforce a security-oriented mindset among developers and operations teams.

Frequent penetration and attack surface testing - A system is only as good as its ability to withstand determined attackers, especially if it’s out on the public Internet. Thus testing the system via penetration testing, scanners, audits, and frequent security reviews are the best methods to help ensure the system is as secure as reasonably possible.

IPS/Security modules for real-time protection - Even with great design and testing, vulnerabilities will slip through, in part via 3rd party tools, app servers, and services. The last line of defense is often a good intrusion detection system that can find patterns of illicit access, scanning, and break-ins. These systems are difficult to setup, manage, and monitor, but they are the best front-line defensive systems for critical systems.

Defense in depth with firewalls and tools at every level - Every secure system is secure in multiple ways, at multiple levels. Called “Defense in Depth” this allows one or more layers to be breached while still maintaining a reasonable degree of protection for the core assets, such as main databases. Thus, firewalls at all levels, from Internet-facing to internal inter-system to on-host iptables are a start. Good design and other principles for each and every OS and service also help maintain maximum security.

Security

Page 5: The 5 Internet Operations Factors That Matter

5

Anyone can build a perfect system given unlimited resources and money. But that’s not how most companies operate, and managing overall cost is a key element of success, especially for smaller companies. Thus everything must be done well to achieve all these key factors, but using a very efficient cost structure, typically with state-of-the-art but not bleeding-edge technologies, systems, and practices. All that is needed to achieve all these goals at the lowest possible cost.

You should think about:

Optimized configurations at every level - It’s easy to cut money out of a system, but at the expense of critical other factors such as reliability, performance, scale, and even security. Inappropriately cutting costs can even lead to higher costs later. Overall, a best-practice strategy and focus on cost management while balancing other factors is best.

Code profiling and tuning tools to eliminate bottlenecks - The best cost savings are using less hardware and resources due to improved system efficiency. Thus performance and scale-related tools such as New Relic and deep monitoring also help reduce hardware needs and cost as a system scales up.

Leveraging public clouds for flexibility - The world’s public clouds have come a long way in just a few years, providing massive scale at extreme flexibility, with lots of very useful extra features. While not cheap for most workloads, the cloud often provides the best balance of flexibility, features, and total cost of ownership (TCO).

Leveraging private clouds for more savings - Private clouds are often the savings strategy of choice for larger systems, as they are far more cost-effective than public clouds, with the loss of some flexibility. This is especially important for larger applications for which large RAM and CPU requirements make public clouds cost-prohibitive. At sufficient size, they can support a wide variety of architectures and solutions, and often include development, test, and production systems on the same hardware.

Today’s Internet systems are big and powerful, literally powering the world as we know it. Building and managing these systems takes a best-practice combination of engineering, tools, processes, and perspectives that deliver modern systems with the best balance of all the five factors mentioned above. Understanding and putting proper emphasis on these five factors is key to 7x24 operational and system success in the 21st Century.

ChinaNetCloud is the leading Internet and a pioneer in the Operations-as-a-Service (OaaS) industry. Based in Shanghai, with Chinese, Asian, and Global customers, ChinaNetCloud focuses on system and server operations for large-scale Internet-facing systems, especially in E-Commerce, gaming, mobile, big data, advertising, and new media. ChinaNetCloud provides server operations and maintenance services for hundreds of companies with millions of combined end users.

About ChinaNetCloud

www.ChinaNetCloud.comE-mail: [email protected]: +86 21-6422-1946Offices: Shanghai/Beijing/Hong Kong/US

Cost Savings