Category Archives: Java

讨论java技术,java设计模式,大规模系统架构

Updating Parts of Documents (Solr 索引文档局部更新实务)

Once you have indexed the content you need in your Solr index, you will want to start thinking about your strategy for dealing with changes to those documents. Solr supports two approaches to updating documents that have only partially changed.

The first is atomic updates. This approach allows changing only one or more fields of a document without having to re-index the entire document.

The second approach is known as optimistic concurrency or optimistic locking. It is a feature of many NoSQL databases, and allows conditional updating a document based on it’s version. This approach includes semantics and rules for how to deal with version matches or mis-matches.

Atomic Updates and Optimistic Concurrency may be used as independent strategies for managing changes to documents, or they may be combined: you can use optimistic concurrency to conditionally apply an atomic update.

Atomic Updates

Solr supports several modifiers that atomically update values of a document. This allows updating only specific fields, which can help speed indexing processes in an environment where speed of index additions is critical to the application.

To use atomic updates, add a modifier to the field that needs to be updated. The content can be updated, added to, or incrementally increased if a number.

Modifier

Usage

set Set or replace the field value(s) with the specified value(s), or remove the values if ‘null’ or empty list is specified as the new value.

May be specified as a single value, or as a list for multivalued fields

add Adds the specified values to a multivalued field.

May be specified as a single value, or as a list.

remove Removes (all occurrences of) the specified values from a multivalued field.

May be specified as a single value, or as a list.

removeregex Removes all occurrences of the specified regex from a multiValued field.

May be specified as a single value, or as a list.

inc Increments a numeric value by a specific amount.

Must be specified as a single numeric value.

Icon

All original source fields must be stored for field modifiers to work correctly, which is the Solr default.

For example, if the following document exists in our collection:

{"id":"mydoc",
 "price":10,
 "popularity":42,
 "categories":["kids"],
 "promo_ids":["a123x"],
 "tags":["free_to_try","buy_now","clearance","on_sale"]
}

And we apply the following update command:

{"id":"mydoc",
 "price":{"set":99},
 "popularity":{"inc":20},
 "categories":{"add":["toys","games"]},
 "promo_ids":{"remove":"a123x"},
 "tags":{"remove":["free_to_try","on_sale"]}
}

The resulting document in our collection will be:

{"id":"mydoc",
 "price":99,
 "popularity":62,
 "categories":["kids","toys","games"],
 "tags":["buy_now","clearance"]
}

Optimistic Concurrency

Optimistic Concurrency is a feature of Solr that can be used by client applications which update/replace documents to ensure that the document they are replacing/updating has not been concurrently modified by another client application. This feature works by requiring a _version_ field on all documents in the index, and comparing that to a _version_ specified as part of the update command. By default, Solr’s schema.xml includes a _version_ field, and this field is automatically added to each new document.

In general, using optimistic concurrency involves the following work flow:

  1. A client reads a document. In Solr, one might retrieve the document with the /get handler to be sure to have the latest version.
  2. A client changes the document locally.
  3. The client resubmits the changed document to Solr, for example, perhaps with the /update handler.
  4. If there is a version conflict (HTTP error code 409), the client starts the process over.

When the client resubmits a changed document to Solr, the _version_ can be included with the update to invoke optimistic concurrency control. Specific semantics are used to define when the document should be updated or when to report a conflict.

  • If the content in the _version_ field is greater than ‘1’ (i.e., ‘12345’), then the _version_ in the document must match the _version_ in the index.
  • If the content in the _version_ field is equal to ‘1’, then the document must simply exist. In this case, no version matching occurs, but if the document does not exist, the updates will be rejected.
  • If the content in the _version_ field is less than ‘0’ (i.e., ‘-1’), then the document must not exist. In this case, no version matching occurs, but if the document exists, the updates will be rejected.
  • If the content in the _version_ field is equal to ‘0’, then it doesn’t matter if the versions match or if the document exists or not. If it exists, it will be overwritten; if it does not exist, it will be added.

If the document being updated does not include the _version_ field, and atomic updates are not being used, the document will be treated by normal Solr rules, which is usually to discard it

For more information, please also see Yonik Seeley’s presentation on NoSQL features in Solr 4 from Apache Lucene EuroCon 2012.

Power Tip

Icon

The _version_ field is by default stored in the inverted index (indexed="true"). However, for some systems with a very large number of documents, the increase in FieldCache memory requirements may be too costly. A solution can be to declare the _version_ field as DocValues:

Sample field definition
<field name="_version_" type="long" indexed="false" stored="true" required="true" docValues="true"/>

 

Document Centric Versioning Constraints

Optimistic Concurrency is extremely powerful, and works very efficiently because it uses an internally assigned, globally unique values for the _version_ field. However, In some situations users may want to configure their own document specific version field, where the version values are assigned on a per-document basis by an external system, and have Solr reject updates that attempt to replace a document with an “older” version. In situations like this the DocBasedVersionConstraintsProcessorFactory can be useful.

The basic usage of DocBasedVersionConstraintsProcessorFactory is to configure it in solrconfig.xml as part of the UpdateRequestProcessorChain and specify the name of the versionField in your schema that should be checked when validating updates:

<processor class="solr.DocBasedVersionConstraintsProcessorFactory">
  <str name="versionField">my_version_l</str>
</processor>

Once configured, this update processor will reject (HTTP error code 409) any attempt to update an existing document where the value of the my_version_l field in the “new” document is not greater then the value of that field in the existing document.

DocBasedVersionConstraintsProcessorFactory supports two additional configuration params which are optional:

  • ignoreOldUpdates – A boolean option which defaults to false. If set to true then instead of rejecting updates where the versionField is too low, the update will be silently ignored (and return a status 200 to the client).
  • deleteVersionParam – A String parameter that can be specified to indicate that this processor should also inspect Delete By Id commands. The value of this configuration option should be the name of a request parameter that the processor will now consider mandatory for all attempts to Delete By Id, and must be be used by clients to specify a value for the versionField which is greater then the existing value of the document to be deleted. When using this request param, any Delete By Id command with a high enough document version number to succeed will be internally converted into an Add Document command that replaces the existing document with a new one which is empty except for the Unique Key and versionField to keeping a record of the deleted version so future Add Document commands will fail if their “new” version is not high enough.

Please consult the processor javadocs and test configs for additional information and example usages.

分享我在新浪网的技术讲座PPT 面向对象与设计模式

混迹互联网开发领域快10年了,失败的经验一大堆,成功的经验不太多。每当自己闲下来的时候,总想分享一下自己关于写代码的一些经验,这些经验,变成了我的认识,大部分吸收于开源社区的营养。他她们都分享出来,也算是回馈开发者社区吧。这篇文档,主要是我对面向对象,以及设计模式的一些看法和理解,受认知能力限制,会有不正确的地方,非常欢迎大家骂街指正.
点击下载面向对象与设计模式

走下神坛的NoSql,Twitter停用Cassandra原因分析

这是个创新失败的例子。
采用成熟稳定的技术,对于TOP 10的公司,仍然是最佳选择。对于新的技术,我们要理性,理智。
Twitter在其7.9一篇官方技术博客Cassandra at Twitter Today提到暂停使用Cassandra来代替MySQL存储feed的计划,这是Twitter一个重要的架构策略调整,因为之前Twitter一直是业界Cassandra方向的领头羊。

For now, we’re not working on using Cassandra as a store for Tweets. This is a change in strategy. Instead we’re going to continue to maintain our existing Mysql-based storage. We believe that this isn’t the time to make large scale migration to a new technology. We will focus our Cassandra work on new projects that we wouldn’t be able to ship without a large-scale data store.

Twitter为什么要停用Cassandra

我们来分析一下Twitter停止使用Cassandra的原因
1. Cassandra仍然缺少大并发海量数据访问的案例及经验,Cassandra来源自Facebook,但是在Facebook内部Cassandra目前只用在inbox search产品上,容量大约有100-200T。且Inbox Search在Facebook的基础架构中也并非核心应用。并且还传出不少rumors说facebook已经放弃Cassandra。

2. 新产品需要一定稳定期,Cassandra代码或许还存在不少问题,但是Twitter如果投入大量的精力来改进Cassandra和比较优化MySQL的投入来看有点得不偿失。在QCon Beijing上@nk也提到Cassandra在Twitter的内部测试中曾经暴露出不少严重的问题。

Twitter为什么之前选用Cassandra

此问题曾经在QCon Beijing 2010做过介绍,在去年的第一期广州技术沙龙也有过交流,类似Twitter这样的网站使用Cassandra的主要原因有
1. 数据增长规模需要不断增加新服务器,传统的切分方案在面临增删硬件时候需要手工维护,当数据规模速度增快,业务又不运行停机维护,手工维护的成本增加造成系统运维不堪重负。
2. 不能简单增加服务器解决请求量增长的问题,需要数据架构师精细的规划。
3. 每一个新的特性都需要重复评估数据拆分及访问优化的问题,架构师需要投入大量精力review几乎相同的业务场景。

Twitter的调整对于MySQL业界来说或许是一大利好,MySQL虽然受近期Oracle收购阴影的影响,但是对于目前大多数拥有海量数据访问的网站依然是他们第一选择。MySQL简单,可靠,安全,配套工具完善,运维成熟。业界碰到的大部分可扩展性方面的问题在MySQL中其实都有清晰明确的解决方法。虽然重复sharding的问题很烦,增删机器相关的运维工作也很繁琐,但是这些工作量还是在可以接受的范围内。

究竟Twitter这次策略改变是NoSQL运动的一次挫折还是前进中的一段小插曲?我们拭目以待。目前另外一大Web 2.0巨头Digg仍然在使用Cassandra。

JVM调优总结

    1. 堆大小设置
      JVM 中最大堆大小有三方面限制:相关操作系统的数据模型(32-bt还是64-bit)限制;系统的可用虚拟内存限制;系统的可用物理内存限制。32位系统下,一般限制在1.5G~2G;64为操作系统对内存无限制。我在Windows Server 2003 系统,3.5G物理内存,JDK5.0下测试,最大可设置为1478m。
      典型设置:
    2. java -Xmx3550m -Xms3550m -Xmn2g -Xss128k
      Xmx3550m :设置JVM最大可用内存为3550M。
      -Xms3550m
      :设置JVM促使内存为3550m。此值可以设置与-Xmx相同,以避免每次垃圾回收完成后JVM重新分配内存。
      -Xmn2g
      :设置年轻代大小为2G。整个堆大小=年轻代大小 + 年老代大小 + 持久代大小 。持久代一般固定大小为64m,所以增大年轻代后,将会减小年老代大小。此值对系统性能影响较大,Sun官方推荐配置为整个堆的3/8。
      -Xss128k
      :设置每个线程的堆栈大小。JDK5.0以后每个线程堆栈大小为1M,以前每个线程堆栈大小为256K。更具应用的线程所需内存大小进行调整。在相同物理内存下,减小这个值能生成更多的线程。但是操作系统对一个进程内的线程数还是有限制的,不能无限生成,经验值在3000~5000左右。
    • java -Xmx3550m -Xms3550m -Xss128k -XX:NewRatio=4 -XX:SurvivorRatio=4 -XX:MaxPermSize=16m -XX:MaxTenuringThreshold=0
      -XX:NewRatio=4
      :设置年轻代(包括Eden和两个Survivor区)与年老代的比值(除去持久代)。设置为4,则年轻代与年老代所占比值为1:4,年轻代占整个堆栈的1/5
      -XX:SurvivorRatio=4
      :设置年轻代中Eden区与Survivor区的大小比值。设置为4,则两个Survivor区与一个Eden区的比值为2:4,一个Survivor区占整个年轻代的1/6
      -XX:MaxPermSize=16m :设置持久代大小为16m。
      -XX:MaxTenuringThreshold=0 :设置垃圾最大年龄。如果设置为0的话,则年轻代对象不经过Survivor区,直接进入年老代 。对于年老代比较多的应用,可以提高效率。如果将此值设置为一个较大值,则年轻代对象会在Survivor区进行多次复制,这样可以增加对象再年轻代的存活时间 ,增加在年轻代即被回收的概论。
    1. 回收器选择
      JVM给了三种选择:串行收集器、并行收集器、并发收集器 ,但是串行收集器只适用于小数据量的情况,所以这里的选择主要针对并行收集器和并发收集器。默认情况下,JDK5.0以前都是使用串行收集器,如果想使用其他收集器需要在启动时加入相应参数。JDK5.0以后,JVM会根据当前系统配置 进行判断。
    2. 吞吐量优先 的并行收集器
      如上文所述,并行收集器主要以到达一定的吞吐量为目标,适用于科学技术和后台处理等。
      典型配置

      • java -Xmx3800m -Xms3800m -Xmn2g -Xss128k -XX:+UseParallelGC -XX:ParallelGCThreads=20
        -XX:+UseParallelGC
        :选择垃圾收集器为并行收集器。 此配置仅对年轻代有效。即上述配置下,年轻代使用并发收集,而年老代仍旧使用串行收集。
        -XX:ParallelGCThreads=20 :配置并行收集器的线程数,即:同时多少个线程一起进行垃圾回收。此值最好配置与处理器数目相等。
      • java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseParallelGC -XX:ParallelGCThreads=20 -XX:+UseParallelOldGC
        -XX:+UseParallelOldGC :配置年老代垃圾收集方式为并行收集。JDK6.0支持对年老代并行收集。
      • java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseParallelGC -XX:MaxGCPauseMillis=100
        -XX:MaxGCPauseMillis=100 : 设置每次年轻代垃圾回收的最长时间,如果无法满足此时间,JVM会自动调整年轻代大小,以满足此值。
      • java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseParallelGC  -XX:MaxGCPauseMillis=100 -XX:+UseAdaptiveSizePolicy
        -XX:+UseAdaptiveSizePolicy
        :设置此选项后,并行收集器会自动选择年轻代区大小和相应的Survivor区比例,以达到目标系统规定的最低相应时间或者收集频率等,此值建议使用并行收集器时,一直打开。
    3. 响应时间优先 的并发收集器
      如上文所述,并发收集器主要是保证系统的响应时间,减少垃圾收集时的停顿时间。适用于应用服务器、电信领域等。
      典型配置

      • java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
        -XX:+UseConcMarkSweepGC :设置年老代为并发收集。测试中配置这个以后,-XX:NewRatio=4的配置失效了,原因不明。所以,此时年轻代大小最好用-Xmn设置。
        -XX:+UseParNewGC :设置年轻代为并行收集。可与CMS收集同时使用。JDK5.0以上,JVM会根据系统配置自行设置,所以无需再设置此值。
      • java -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseConcMarkSweepGC-XX:CMSFullGCsBeforeCompaction=5 -XX:+UseCMSCompactAtFullCollection
        -XX:CMSFullGCsBeforeCompaction :由于并发收集器不对内存空间进行压缩、整理,所以运行一段时间以后会产生“碎片”,使得运行效率降低。此值设置运行多少次GC以后对内存空间进行压缩、整理。
        -XX:+UseCMSCompactAtFullCollection :打开对年老代的压缩。可能会影响性能,但是可以消除碎片
    1. 辅助信息
      JVM提供了大量命令行参数,打印信息,供调试使用。主要有以下一些:
    2. -XX:+PrintGC
      输出形式:[GC 118250K->113543K(130112K), 0.0094143 secs] [Full GC 121376K->10414K(130112K), 0.0650971 secs]
    • -XX:+PrintGCDetails
      输出形式:[GC [DefNew: 8614K->781K(9088K), 0.0123035 secs] 118250K->113543K(130112K), 0.0124633 secs] [GC [DefNew: 8614K->8614K(9088K), 0.0000665 secs][Tenured: 112761K->10414K(121024K), 0.0433488 secs] 121376K->10414K(130112K), 0.0436268 secs]
    • -XX:+PrintGCTimeStamps -XX:+PrintGC:PrintGCTimeStamps可与上面两个混合使用
      输出形式:11.851: [GC 98328K->93620K(130112K), 0.0082960 secs]
    • -XX:+PrintGCApplicationConcurrentTime: 打印每次垃圾回收前,程序未中断的执行时间。可与上面混合使用
      输出形式:Application time: 0.5291524 seconds
    • -XX:+PrintGCApplicationStoppedTime :打印垃圾回收期间程序暂停的时间。可与上面混合使用
      输出形式:Total time for which application threads were stopped: 0.0468229 seconds
    • -XX:PrintHeapAtGC :打印GC前后的详细堆栈信息
      输出形式:
      34.702: [GC {Heap before gc invocations=7:
      def new generation   total 55296K, used 52568K [0x1ebd0000, 0x227d0000, 0x227d0000)
      eden space 49152K,  99% used [0x1ebd0000, 0x21bce430, 0x21bd0000)
      from space 6144K,  55% used [0x221d0000, 0x22527e10, 0x227d0000)
      to   space 6144K,   0% used [0x21bd0000, 0x21bd0000, 0x221d0000)
      tenured generation   total 69632K, used 2696K [0x227d0000, 0x26bd0000, 0x26bd0000)
      the space 69632K,   3% used [0x227d0000, 0x22a720f8, 0x22a72200, 0x26bd0000)
      compacting perm gen  total 8192K, used 2898K [0x26bd0000, 0x273d0000, 0x2abd0000)
      the space 8192K,  35% used [0x26bd0000, 0x26ea4ba8, 0x26ea4c00, 0x273d0000)
      ro space 8192K,  66% used [0x2abd0000, 0x2b12bcc0, 0x2b12be00, 0x2b3d0000)
      rw space 12288K,  46% used [0x2b3d0000, 0x2b972060, 0x2b972200, 0x2bfd0000)
      34.735: [DefNew: 52568K->3433K(55296K), 0.0072126 secs] 55264K->6615K(124928K)Heap after gc invocations=8:
      def new generation   total 55296K, used 3433K [0x1ebd0000, 0x227d0000, 0x227d0000)
      eden space 49152K,   0% used [0x1ebd0000, 0x1ebd0000, 0x21bd0000)
      from space 6144K,  55% used [0x21bd0000, 0x21f2a5e8, 0x221d0000)
      to   space 6144K,   0% used [0x221d0000, 0x221d0000, 0x227d0000)
      tenured generation   total 69632K, used 3182K [0x227d0000, 0x26bd0000, 0x26bd0000)
      the space 69632K,   4% used [0x227d0000, 0x22aeb958, 0x22aeba00, 0x26bd0000)
      compacting perm gen  total 8192K, used 2898K [0x26bd0000, 0x273d0000, 0x2abd0000)
      the space 8192K,  35% used [0x26bd0000, 0x26ea4ba8, 0x26ea4c00, 0x273d0000)
      ro space 8192K,  66% used [0x2abd0000, 0x2b12bcc0, 0x2b12be00, 0x2b3d0000)
      rw space 12288K,  46% used [0x2b3d0000, 0x2b972060, 0x2b972200, 0x2bfd0000)
      }
      , 0.0757599 secs]
    • -Xloggc:filename :与上面几个配合使用,把相关日志信息记录到文件以便分析。
    1. 常见配置汇总
    2. 堆设置
      • -Xms :初始堆大小
      • -Xmx :最大堆大小
      • -XX:NewSize=n :设置年轻代大小
      • -XX:NewRatio=n: 设置年轻代和年老代的比值。如:为3,表示年轻代与年老代比值为1:3,年轻代占整个年轻代年老代和的1/4
      • -XX:SurvivorRatio=n :年轻代中Eden区与两个Survivor区的比值。注意Survivor区有两个。如:3,表示Eden:Survivor=3:2,一个Survivor区占整个年轻代的1/5
      • -XX:MaxPermSize=n :设置持久代大小
    3. 收集器设置
      • -XX:+UseSerialGC :设置串行收集器
      • -XX:+UseParallelGC :设置并行收集器
      • -XX:+UseParalledlOldGC :设置并行年老代收集器
      • -XX:+UseConcMarkSweepGC :设置并发收集器
    4. 垃圾回收统计信息
      • -XX:+PrintGC
      • -XX:+PrintGCDetails
      • -XX:+PrintGCTimeStamps
      • -Xloggc:filename
    5. 并行收集器设置
      • -XX:ParallelGCThreads=n :设置并行收集器收集时使用的CPU数。并行收集线程数。
      • -XX:MaxGCPauseMillis=n :设置并行收集最大暂停时间
      • -XX:GCTimeRatio=n :设置垃圾回收时间占程序运行时间的百分比。公式为1/(1+n)
    6. 并发收集器设置
      • -XX:+CMSIncrementalMode :设置为增量模式。适用于单CPU情况。
      • -XX:ParallelGCThreads=n :设置并发收集器年轻代收集方式为并行收集时,使用的CPU数。并行收集线程数。

    Tomcat Vs Resin

    两种主流的应用服务器,下面来测试一下性能:)

    Resin 3.0.9   vs tomcat 5.0  ,jdk5.0,安装应用为 AppFuse

    Resin启动时间 25s ,Tomcat 启动时间14s

    执行 “ant test-canoo” 执行  Canoo’s WebTest (服务器已经启动): jsp编译后的文件已经清除

    • Resin: 初次执行 (未编译 JSPs) – 53 s, 第二次执行 (编译 JSPs) – 24 s.
    • Tomcat: 初次执行 (未编译 JSPs) – 49 s,  第二次执行 (编译 JSPs) – 14 s.

    只是简单测试,但是还是有参考价值的