Solution

Address： 1st Floor,Building 4, 1088th, Huyi highway, Shanghai
TEL：021-31080981
Email：soline@soline.com.cn
P.C.：201802

RPC middleware solution

The basis of RPC is network communication + serialization, which is only a basic implementation. There are more other governance issues on it, which are also common issues in distributed systems.

Serialization

The performance, memory, data size, and compatibility of the serialization framework

Service discovery

zookeeper

Create the service root directory on the management side
The service provider creates a temporary node in the provider directory to store the information of the producer
The service consumer creates a temporary node in the consumer directory to store consumer information
Consumer watch producer directory, get notified when the producer node changes

Final consistency

The cp feature of zk will cause performance problems. In large-scale service discovery, AP is the king.

For example, after using the registry to receive the registration message, the message is sent to the message bus to synchronize to other registry, or through timing tasks in erureka for synchronization.

In this way, attention should be paid to the de-duplication of messages. The synchronization message should be accompanied by a monotonically increasing version number. Other registration centers only accept update messages with a higher version number.

Push-pull mode is combined, push is mainly callback, and pull is client-side training

Health monitoring

The instance obtained by service discovery may be unavailable.

It is necessary to add a health monitoring mechanism such as a heartbeat mechanism to evaluate the health status of the node, such as healthy, sub-healthy (heartbeat failure for several consecutive times), and unavailable state (connection failure). The health node is preferentially used when calling the service.

At the same time, in order to prevent the occurrence of intermittent successes and failures and the number of consecutive times does not reach the threshold, thereby causing misjudgment, it is necessary to use other dimensions to calculate the health, such as the number of failures in 30 seconds in eureka, which is a failure within a period of time. The number of times is measured as the availability rate.

At the same time, health monitoring may also monitor whether other dependencies of the service itself are available. For example, although the service is available, its redis may be down. Currently in sc, the health endpoint of the actuator is used to monitor the overall health level

Routing strategy

The caller adds filtering logic when selecting service providers. In scenarios such as traffic switching and grayscale publishing, routing strategies are used to slowly and smoothly cut the traffic of the old application into the new application, and then the old application will be offline after completion.

Make the selection into a configuration for dynamic delivery.

Load balancing

Configure weights to control the inflow of different machines. If you find that the availability is reduced and then adjust, it has affected the business, it is best to intelligently control the weights in advance.

If you use load balancing equipment, you will have the following problems:

The single point problem of load balancing equipment, if it is a cluster, large-scale online and offline and expansion will bring complicated operation and maintenance

Additional load balancing equipment cost

Additional network proxy access

The load balancing strategy is unified, and different strategies cannot be flexibly configured according to different scenarios

So RPC needs to implement load balancing by itself

Commonly used algorithms such as round training, random, hash, weight, etc.

Adaptive: Set the indicator collector to collect the provider's load information in real time (can be collected by heartbeat) and various indicators of service invocation, such as delay, and comprehensively score to intelligently load balance.

The difference between routing and balancing is generally to screen out a batch of providers according to the rules set by the routing, and then use the balancing algorithm to make the balancing call.

Exception retry

RPC goes through the network, so it must be retried abnormally, but the time and frequency must be controlled. Too long will hang the business, too short will cause misjudgment and lead to inconsistent data. At the same time, note that the last abnormal node is generally removed when retrying, and the timeout period should be reset when retrying.

However, it should be noted that due to network jitter, the rpc interface generally requires idempotence. At this time, abnormal retry is meaningful, otherwise it may cause major problems. At the same time, not all exceptions can be retried. Some exceptions are thrown by the business and have specific meaning and cannot be retried. A whitelist can be configured, and only specific exceptions are allowed to be retried.

Graceful start and stop

When the provider wants to go offline, the caller does not know, which may cause an impact.

When the provider goes offline, the registry will perform offline operations like the registry, and then the registry will deliver to each consumer, thereby eliminating offline nodes. However, due to the issue of eventual consistency, the consumer may not be able to get the provider’s offline in time news.

In one method, the provider actively informs all callers, because the service fee holds the long link of the caller, and it is enough to traverse the call to notify the offline, but there may be problems, such as the call and the closing notification occur at the same time in a small time difference. Network jitter may occur when the provider notifies that the shutdown is complete. During the shutdown process, the call request comes over. Since some objects are destroyed during the shutdown process, errors may occur when continuing to process the business, so the call should continue.

At this time, a shutdown flag can be set. When the call is over, it is found that it is shutting down, and no processing is performed. If it fails directly, the caller will retry other nodes. At the same time, in order not to affect the ongoing task, set the task execution counter, after turning on the off flag bit, wait for the counter to return to zero or delay for a certain period of time before closing the operation.

When the application starts, when the main thread loads sequentially, there are other beans that need to be loaded after registering with the registry, that is, the application has not been started completely, and if there is traffic inflow at this time, an error may occur. Need to delay access. At this time, you can delay the registration action, and register after the application is started. At the same time, the registration pre-action is provided, the request is simulated, the application is warmed up, and then the registration is completed.

Or the newly launched application has no cache and compilation improvement, and the speed is slow. Once a large-scale traffic influx may cause a large area of timeout, sometimes it is necessary to let the application access a small part of the traffic for warm-up before accessing the large-scale traffic. This can use the service provider's startup time (or registration time) as a weight in the load balancing algorithm for traffic distribution.

For a large number of nodes that need to be restarted, the availability may be reduced due to delayed access and warm-up. The best solution is to restart in batches.

Fuse current limit

When a provider's load is too high, it is necessary to use current limiting to restrict access and reduce pressure.

Current limiting can be divided into single-machine current limiting and current limiting services. The former is that the caller decides his own current limiting degree according to the configured current limiting information. For example, 10 machines have the same current limiting range for each machine, and the latter is a unified connection. Enter the current limiting service to determine whether to start current limiting. Both have their pros and cons.

For upper-layer services that depend on lower-layer services, the unavailability of upper-layer services due to lower-layer services may lead to the collapse of the entire call chain. At this time, the fuse strategy must be activated to protect the upper-layer callers on the call chain.

Flow isolation

The service provider provides services for all different interface calls in a holistic manner. If the traffic of a certain interface increases sharply, it may blow up all service providers and cause other business interfaces to be unavailable. Therefore, traffic isolation is required to isolate the service. Providers are grouped, and different consumers get different instances of the server and make calls.

It can be modified during service discovery, and service discovery should be accompanied by business grouping attributes for discovery.

Due to grouping, there are fewer available nodes for a particular interface provider. In order to ensure high availability, you can set primary and secondary groupings. When the primary grouping nodes are all unavailable, you can temporarily borrow the nodes of the secondary grouping for services, and at the same time, in order not to affect the secondary too much For grouped nodes, only a small part of the nodes can be allocated as sub-grouping nodes.

Asynchronous

The performance of RPC is mostly wasted in business synchronization and time-consuming. Using asynchronous to improve throughput can significantly improve performance.

For the RPC request and response, they belong to two independent operations. Generally, the request will be accompanied by a message id, and a future will be created to return, while maintaining the id and future mapping, when the response is obtained, the future is found according to the mapping and the result is set. Synchronous RPC will actively get to block waiting for the result.

When the server receives a message, the asynchronous serialization and unpacking of the message are performed on the io thread, but a separate thread pool is generally used for business logic. After the business thread pool finishes the task, it must be handed over to IO to respond. Completefuture can be used to further asynchronous and improve throughput. The caller can also use the callback method to further asynchronous.

Safety

In order to prevent anyone from being able to call the interface, it is necessary to provide secure identity authentication, and the caller obtains the key for identity authentication.

Or other methods, limit the interface call permissions

Clock wheel

During the execution of batch timing tasks (such as timing scan timeout tasks), if a large number of tasks are delayed, the scanning thread will repeatedly scan a large number of tasks that will not be executed, and waste CPU.

Set the clock wheel, put tasks executed with different delay times in different time slots, the scanning thread scans at a fixed frequency according to the clock wheel, and only scans the tasks in the corresponding clock wheel slot, which greatly reduces the number of scanning tasks. Its time complexity is lower than that of priority queues. Netty provides TimeWheel.

Traffic playback

Traffic playback similar to tcpcopy and nginx, regression testing, stress testing, etc.

Just record all the requests and send them again

Dynamic grouping

For traffic isolation, dynamic grouping needs to be supported when grouping to flexibly respond to traffic changes.

Can be achieved by modifying the form of the registry, or dynamic configuration.

No interface call

There is no interface provided by the server, or the caller only needs a few interfaces, and there is no need to rely on all the provider’s interfaces. The provider can provide a pan-China interface, and the caller provides the necessary RPC such as the method name in the message sent For the related information of the call, the server can parse the call and return the result.

上一篇：Zookeeper middleware solution

下一篇：Message middleware solution JMS