Category Archives: System

Linux平台系统配置,搭建

Spark 2.0 Technical Preview: Easier, Faster, and Smarter

For the past few months, we have been busy working on the next major release of the big data open source software we love: Apache Spark 2.0. Since Spark 1.0 came out two years ago, we have heard praises and complaints. Spark 2.0 builds on what we have learned in the past two years, doubling down on what users love and improving on what users lament. While this blog summarizes the three major thrusts and themes—easier, faster, and smarter—that comprise Spark 2.0, the themes highlighted here deserve deep-dive discussions that we will follow up with in-depth blogs in the next few weeks.

Before we dive in, we are happy to announce the availability of the Apache Spark 2.0 technical preview in Databricks Community Edition today. This preview package is built using the upstream branch-2.0. Using the preview package is as simple as selecting the “2.0 (branch preview)” version when launching a cluster:

Screenshot of creating a new Apache Spark 2.0 Tech Preview Cluster workflow in Databricks

Whereas the final Apache Spark 2.0 release is still a few weeks away, this technical preview is intended to provide early access to the features in Spark 2.0 based on the upstream codebase. This way, you can satisfy your curiosity to try the shiny new toy, while we get feedback and bug reports early before the final release.

Now, let’s take a look at the new developments.

Spark 2.0: Easier, Faster, Smarter

Easier: SQL and Streamlined APIs

One thing we are proud of in Spark is creating APIs that are simple, intuitive, and expressive. Spark 2.0 continues this tradition, with focus on two areas: (1) standard SQL support and (2) unifying DataFrame/Dataset API.

On the SQL side, we have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for subqueries. Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features. Because SQL has been one of the primary interfaces Spark applications use, this extended SQL capabilities drastically reduce the porting effort of legacy applications over to Spark.

On the programming API side, we have streamlined the APIs:

  • Unifying DataFrames and Datasets in Scala/Java: Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. Both the typed methods (e.g. map, filter, groupByKey) and the untyped methods (e.g. select, groupBy) are available on the Dataset class. Also, this new combined Dataset interface is the abstraction used for Structured Streaming. Since compile-time type-safety in Python and R is not a language feature, the concept of Dataset does not apply to these languages’ APIs. Instead, DataFrame remains the primary programing abstraction, which is analogous to the single-node data frame notion in these languages. Get a peek from a Dataset API notebook.
  • SparkSession: a new entry point that replaces the old SQLContext and HiveContext. For users of the DataFrame API, a common source of confusion for Spark is which “context” to use. Now you can use SparkSession, which subsumes both, as a single entry point, as demonstrated in this notebook. Note that the old SQLContext and HiveContext are still kept for backward compatibility.
  • Simpler, more performant Accumulator API: We have designed a new Accumulator API that has a simpler type hierarchy and support specialization for primitive types. The old Accumulator API has been deprecated but retained for backward compatibility
  • DataFrame-based Machine Learning API emerges as the primary ML API: With Spark 2.0, the spark.ml package, with its “pipeline” APIs, will emerge as the primary machine learning API. While the original spark.mllib package is preserved, future development will focus on the DataFrame-based API.
  • Machine learning pipeline persistence: Users can now save and load machine learning pipelines and models across all programming languages supported by Spark.
  • Distributed algorithms in R: Added support for Generalized Linear Models (GLM), Naive Bayes, Survival Regression, and K-Means in R.

Faster: Spark as a Compiler

According to our 2015 Spark Survey, 91% of users consider performance as the most important aspect of Spark. As a result, performance optimizations have always been a focus in our Spark development. Before we started planning for Spark 2.0, we asked ourselves a question: Spark is already pretty fast, but can we push the boundary and make Spark 10X faster?

This question led us to fundamentally rethink the way we build Spark’s physical execution layer. When you look into a modern data engine (e.g. Spark or other MPP databases), majority of the CPU cycles are spent in useless work, such as making virtual function calls or reading/writing intermediate data to CPU cache or memory. Optimizing performance by reducing the amount of CPU cycles wasted in these useless work has been a long time focus of modern compilers.

Spark 2.0 ships with the second generation Tungsten engine. This engine builds upon ideas from modern compilers and MPP databases and applies them to data processing. The main idea is to emit optimized bytecode at runtime that collapses the entire query into a single function, eliminating virtual function calls and leveraging CPU registers for intermediate data. We call this technique “whole-stage code generation.”

To give you a teaser, we have measured the amount of time (in nanoseconds) it would take to process a row on one core for some of the operators in Spark 1.6 vs. Spark 2.0, and the table below is a comparison that demonstrates the power of the new Tungsten engine. Spark 1.6 includes expression code generation technique that is also in use in some state-of-the-art commercial databases today. As you can see, many of the core operators are becoming an order of magnitude faster with whole-stage code generation.

You can see the power of whole-stage code generation in action in this notebook, in which we perform aggregations and joins on 1 billion records on a single machine.

cost per row (single thread)
primitive Spark 1.6 Spark 2.0
filter 15ns 1.1ns
sum w/o group 14ns 0.9ns
sum w/ group 79ns 10.7ns
hash join 115ns 4.0ns
sort (8-bit entropy) 620ns 5.3ns
sort (64-bit entropy) 620ns 40ns
sort-merge join 750ns 700ns

How does this new engine work on end-to-end queries? We did some preliminary analysis using TPC-DS queries to compare Spark 1.6 and Spark 2.0:

Preliminary TPC-DS Spark 2.0 vs 1.6

Beyond whole-stage code generation to improve performance, a lot of work has also gone into improving the Catalyst optimizer for general query optimizations such as nullability propagation, as well as a new vectorized Parquet decoder that has improved Parquet scan throughput by 3X.

Smarter: Structured Streaming

Spark Streaming has long led the big data space as one of the first attempts at unifying batch and streaming computation. As a first streaming API called DStream and introduced in Spark 0.7, it offered developers with several powerful properties: exactly-once semantics, fault-tolerance at scale, and high throughput.

However, after working with hundreds of real-world deployments of Spark Streaming, we found that applications that need to make decisions in real-time often require more than just a streaming engine. They require deep integration of the batch stack and the streaming stack, integration with external storage systems, as well as the ability to cope with changes in business logic. As a result, enterprises want more than just a streaming engine; instead they need a full stack that enables them to develop end-to-end “continuous applications.”

One school of thought is to treat everything like a stream; that is, adopt a single programming model integrating both batch and streaming data.

A number of problems exist with this single model. First, operating on data as it arrives in can be very difficult and restrictive. Second, varying data distribution, changing business logic, and delayed data—all add unique challenges. And third, most existing systems, such as MySQL or Amazon S3, do not behave like a stream and many algorithms (including most off-the-shelf machine learning) do not work in a streaming setting.

Spark 2.0’s Structured Streaming APIs is a novel way to approach streaming. It stems from the realization that the simplest way to compute answers on streams of data is to not having to reason about the fact that it is a stream. This realization came from our experience with programmers who already know how to program static data sets (aka batch) using Spark’s powerful DataFrame/Dataset API. The vision of Structured Streaming is to utilize the Catalyst optimizer to discover when it is possible to transparently turn a static program into an incremental execution that works on dynamic, infinite data (aka a stream). When viewed through this structured lens of data—as discrete table or an infinite table—you simplify streaming.

As the first step towards realizing this vision, Spark 2.0 ships with an initial version of the Structured Streaming API, a (surprisingly small!) extension to the DataFrame/Dataset API. This unification should make adoption easy for existing Spark users, allowing them to leverage their knowledge of Spark batch API to answer new questions in real-time. Key features here will include support for event-time based processing, out-of-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks.

Streaming is clearly a pretty broad topic, so stay tuned to this blog for more details on Structured Streaming in Spark 2.0, including details on what is possible in this release and what is on the roadmap for the near future.

Conclusion

Spark users initially came to Spark for its ease-of-use and performance. Spark 2.0 doubles down on these while extending it to support an even wider range of workloads. We hope you will enjoy the work we have put it in, and look forward to your feedback.

Of course, until the upstream Apache Spark 2.0 release is finalized, we do not recommend fully migrating any production workload onto this preview package. This new package should be available on Databricks Community Edition today, and we will be rolling out to all Databricks customers over the next few days. To get access to Databricks Community Edition,

Micro web framework for low-resource systems ESP8266

ESP8266 has 64KB DRAM, 96KB SRAM, 32-bit RISC CPU running at 80Mhz and it has to handle also a wireless traffic. Despite this it’s still able to dispatch more than 50 reqs/s on Hello World. And serious optimization hasn’t started yet.

ESP8266

ESP8266

Ref blog : http://www.ureq.solusipse.net/

Ref git :https://github.com/solusipse/ureq

 

 

Google 比利时数据中心遭雷击,导致数据丢失

Google loses data as lightning strikes

Google loses data as lightning strikes

 

 

 

From Thursday 13 August 2015 to Monday 17 August 2015,Google 在比利时的数据中心遭到4次雷击,问题初步原因是雷击破坏了供电系统,导致GCE服务( Google Compute Engine ) 有用户丢失数据。突然断电导致GCE的虚拟机不能及时会写数据。看来这个数据中心的UPS系统完成切换,但是仍旧发生了数据丢失。

受影响的数据存储占总存储空间的0.000001%,但是考虑到Google数据中心的海量数据,总的影响范围会在较高水平。

根据Google公开的报告,在事故发生区间,5%的回写IO至少发生一次读写错误。

Google比利时数据中心2007年开始建造,耗资2.5亿欧元,在2010正式投入使用。数据中心环境友好,能效比高,采用水冷系统,参考(advanced evaporative cooling system )。 这个数据 中心创造了3900个就业岗位,估计带给比利时政府22亿欧元的收入。

参考Google事故声明原文,非常值得国内同行学习。

In an online statement, Google said

PHP-FPM Configuration Tips and Tricks

PHP-FPM Tip 1. – PHP-FPM Configuration files
Normally PHP-FPM configuration files are located on /etc/php-fpm.conf file and /etc/php-fpm.d path. This is normally excellent start and all pool configs goes to /etc/php-fpm.d directory. You need to add following include line on your php-fpm.conf file:
include=/etc/php-fpm.d/*.conf
PHP-FPM Tip 2. – PHP-FPM Global Configuration Tweaks
Set up emergency_restart_threshold, emergency_restart_interval and process_control_timeout. Default values for these options are totally off, but I think it’s better use these options example like following:
emergency_restart_threshold 10
emergency_restart_interval 1m
process_control_timeout 10s
What this mean? So if 10 PHP-FPM child processes exit with SIGSEGV or SIGBUS within 1 minute then PHP-FPM restart automatically. This configuration also sets 10 seconds time limit for child processes to wait for a reaction on signals from master.
PHP-FPM Tip 3. – PHP-FPM Pools Configuration
With PHP-FPM it’s possible to use different pools for different sites and allocate resources very accurately and even use different users and groups for every pool. Following is just example configuration files structure for PHP-FPM pools for three different sites (or actually three different part of same site):
/etc/php-fpm.d/site.conf
/etc/php-fpm.d/blog.conf
/etc/php-fpm.d/forums.conf
Just example configurations for every pool:
/etc/php-fpm.d/site.conf
[site]
listen = 127.0.0.1:9000
user = site
group = site
request_slowlog_timeout = 5s
slowlog = /var/log/php-fpm/slowlog-site.log
listen.allowed_clients = 127.0.0.1
pm = dynamic
pm.max_children = 5
pm.start_servers = 3
pm.min_spare_servers = 2
pm.max_spare_servers = 4
pm.max_requests = 200
listen.backlog = -1
pm.status_path = /status
request_terminate_timeout = 120s
rlimit_files = 131072
rlimit_core = unlimited
catch_workers_output = yes
env[HOSTNAME] = $HOSTNAME
env[TMP] = /tmp
env[TMPDIR] = /tmp
env[TEMP] = /tmp
/etc/php-fpm.d/blog.conf
[blog]
listen = 127.0.0.1:9001
user = blog
group = blog
request_slowlog_timeout = 5s
slowlog = /var/log/php-fpm/slowlog-blog.log
listen.allowed_clients = 127.0.0.1
pm = dynamic
pm.max_children = 4
pm.start_servers = 2
pm.min_spare_servers = 1
pm.max_spare_servers = 3
pm.max_requests = 200
listen.backlog = -1
pm.status_path = /status
request_terminate_timeout = 120s
rlimit_files = 131072
rlimit_core = unlimited
catch_workers_output = yes
env[HOSTNAME] = $HOSTNAME
env[TMP] = /tmp
env[TMPDIR] = /tmp
env[TEMP] = /tmp
/etc/php-fpm.d/forums.conf
[forums]
listen = 127.0.0.1:9002
user = forums
group = forums
request_slowlog_timeout = 5s
slowlog = /var/log/php-fpm/slowlog-forums.log
listen.allowed_clients = 127.0.0.1
pm = dynamic
pm.max_children = 10
pm.start_servers = 3
pm.min_spare_servers = 2
pm.max_spare_servers = 4
pm.max_requests = 400
listen.backlog = -1
pm.status_path = /status
request_terminate_timeout = 120s
rlimit_files = 131072
rlimit_core = unlimited
catch_workers_output = yes
env[HOSTNAME] = $HOSTNAME
env[TMP] = /tmp
env[TMPDIR] = /tmp
env[TEMP] = /tmp
So this is just example howto configure multiple different size pools.
PHP-FPM Tip 4. – PHP-FPM Pool Process Manager (pm) Configuration
Best way to use PHP-FPM process manager is use dynamic process management, so PHP-FPM processes are started only when needed. This is almost same style setup than Nginx worker_processes and worker_connections setup. So very high values does not mean necessarily anything good. Every process eat memory and of course if site have very high traffic and server lot’s of memory then higher values are right choise, but servers, like VPS (Virtual Private Servers) memory is normally limited to 256 Mb, 512 Mb, 1024 Mb. This low RAM is enough to handle even very high traffic (even dozens of requests per second), if it’s used wisely.
It’s good to test how many PHP-FPM processes a server could handle easily, first start Nginx and PHP-FPM and load some PHP pages, preferably all of the heaviest pages. Then check memory usage per PHP-FPM process example with Linux top or htop command. Let’s assume that the server has 512 Mb memory and 220 Mb could be used for PHP-FPM, every process use 24 Mb RAM (some huge content management system with plugins can easily use 20-40 Mb / per PHP page request or even more). Then simply calculate the server max_children value:
220 / 24 = 9.17
So good pm.max_children value is 9. This is based just quick average and later this could be something else when you see longer time memory usage / per process. After quick testing it’s much easier to setup pm.start_servers value, pm.min_spare_servers value and pm.max_spare_servers value.
Final example configuration could be following:
pm.max_children = 9
pm.start_servers = 3
pm.min_spare_servers = 2
pm.max_spare_servers = 4
pm.max_requests = 200
Max request per process is unlimited by default, but it’s good to set some low value, like 200 and avoid some memory issues. This style setup could handle large amount of requests, even if the numbers seems to be small.

Nginx Configuration and Optimizing Tips and Tricks

Nginx Tip 1. – Organize Nginx Configuration Files
Normally Nginx configuration files are located under /etc/nginx path.
One good way to organize configuration files is use Debian/Ubuntu Apache style setup:
## Main configuration file ##
/etc/nginx/nginx.conf

## Virtualhost configuration files on ##
/etc/nginx/sites-available/
/etc/nginx/sites-enabled/

## Other config files on (if needed) ##
/etc/nginx/conf.d/
Virtualhost files have 2 paths, because sites-available directory can contain any stuff, like test configs, just copied/created configs, old configs and so on. And sites-enabled contains only really enabled configurations, actually just only symbolic links to sites-available directory.
Remember add following includes at the end of your nginx.conf file:
## Load virtual host conf files. ##
include /etc/nginx/sites-enabled/*;

## Load another configs from conf.d/ ##
include /etc/nginx/conf.d/*;
Nginx Tip 2. – Determine Nginx worker_processes and worker_connections
Default setup is okay for worker_processes and worker_connections, but these values could be little bit optimized:
max_clients = worker_processes * worker_connections
Just Nginx basic setup can handle hundreds of concurrent connection:
worker_processes 1;
worker_connections 1024;
Normally 1000 concurrent connection / per one server is good, but sometimes other parts like disks on server might be slow, and it causes that the Nginx is locked on I/O operations. To avoid locking use example following setup: one worker_precess / per processor core, like:
Worker Processes
worker_processes [number of processor cores];
To check how many processor cores do you have, run following command:
cat /proc/cpuinfo |grep processor
processor : 0
processor : 1
processor : 2
processor : 3
So here is 4 cores and worker_processes final setup could be following:
worker_processes 4;
Worker Connections
Personally I stick with 1024 worker connections, because I don’t have any reason to raise this value. But if example 4096 connections per second is not enough then it’s possible to try to double this and set 2048 connections per process.
worker_processes final setup could be following:
worker_connections 1024;
I have seen some configurations where server admins are used too much Apache and think if I set Nginx worker_processes to 50 and worker_connections to 20000 then my server could handle all traffic once what we get monthly…but yes it’s not true. It’s just wasting of resources and might cause some serious problems…
Nginx Tip 3. – Hide Nginx Server Tokens / Hide Nginx version number
This is good for security reasons hide server tokens / hide Nginx version number, especially, if run some outdated version of Nginx. This is very easy to do just set server_tokens off under http/server/location section, like:
server_tokens off;
Nginx Tip 4. – Nginx Request / Upload Max Body Size (client_max_body_size)
If you want to allow users upload something or upload personally something over the HTTP then you should maybe increase post size. It can be done with client_max_body_size value which goes under http/server/location section. On default it’s 1 Mb, but it can be set example to 20 Mb and also increase buffer size with following configuration:
client_max_body_size 20m;
client_body_buffer_size 128k;
If you get following error, then you know that client_max_body_size is too low:
“Request Entity Too Large” (413)
Nginx Tip 5. – Nginx Cache Control for Static Files (Browser Cache Control Directives)
Browser caching is import if you want save resources and bandwith. It’s easy setup with Nginx, following is very basic setup where logging (access log and not found log) is turned off and expires headers are set to 360 days.
location ~* \.(jpg|jpeg|gif|png|css|js|ico|xml)$ {
access_log off;
log_not_found off;
expires 360d;
}
If you want more complicated headers or some other expiration by filetypes then you could configure those separately.
Nginx Tip 6. – Nginx Pass PHP requests to PHP-FPM
Here you could use default tpc/ip stack or use directly Unix socket connection. You have to also setup PHP-FPM listen exactly same ip:port or unix socket (with Unix socket also socket permission have to be right). Default setup is use ip:port (127.0.0.1:9000) you could of course change ips and ports what PHP-FPM listens. Here is very basic configuration with Unix socket example commented out:
# Pass PHP scripts to PHP-FPM
location ~* \.php$ {
fastcgi_index index.php;
fastcgi_pass 127.0.0.1:9000;
#fastcgi_pass unix:/var/run/php-fpm/php-fpm.sock;
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_param SCRIPT_NAME $fastcgi_script_name;
}
It’s also possible to run PHP-FPM another server and Nginx another.
Nginx Tip 7. – Prevent (deny) Access to Hidden Files with Nginx
It’s very common that server root or other public directories have hidden files, which starts with dot (.) and normally those is not intended to site users. Public directories can contain version control files and directories, like .svn, some IDE properties files and .htaccess files. Following deny access and turn off logging for all hidden files.
location ~ /\. {
access_log off;
log_not_found off;
deny all;
}