一、常见宕机现象

1.访问服务器提示无法访问此网站

F12的网络状态提示ERR_CONNECTION_TIMED_OUT

2.访问服务器一直在转圈,页面空白

F12的网络状态提示状态一直处于pending状态。

二、宕机分析借助的工具或日志

由于服务器一般部署在Linux系统,以下都使用Linux操作系统和Tomcat为例说明:

  1. 确认服务端口是否联通,使用curl命令

  2. 在目标服务器查看java进程是否存在,使用ps/jps命令

  3. 查看操作系统(Linux)日志:/var/log/messages,java进程崩溃自动生成日志:hs_err_pidxxx.log,tomcat日志:catalina.out,smartbi的日志:smartbi.log,通常如果是内存问题日志都会显示

  4. 查看内存溢出堆快照文件(xxxx.hprof), 使用jmap生成的堆快照文件,用于确认系统具体什么操作造成内存占用过大,用于判断是优化系统,还是增加配置;

  5. 使用jstack生成的线程快照文件,确认是否存在死锁、数据库连接池是否满之类问题;

  6. 使用使用 top -H -p <进程号>查看具体哪个线程占用内存、cpu最高,如垃圾回收线程占用cpu最高意味着可能内存不足;

详细信息请见排查路径>信息收集章节,收集这些信息后就可上报问题分析。

三、排查路径

(一)信息收集

1.确认服务器是否可以访问

步骤1:使用 curl http://locahost:<port>/smartbi/vision/index.jsp

步骤2:确认服务器是否返回内容(注:如果服务器配置cas,sso等跳转到其它服务器的扩展包,需要使用其他无需跳转的页面进行测试)

步骤3:如服务器正常返回,需要检查代理是否正常。

2.确认进程是否存在

通过ps或jps命令查看java进程是否存在。

进程存在:

进程不存在:

如果进程不存在。检查服务器启动目录是否生成hs_err_pidxxx.log,/var/log/messages是否记录了Out of memory: Kill process 信息

如果进程存在,进入下一步。

3.确认是否存在内存溢出或内存不足

方法一:检查服务器启动目录是否生成xxxx.hprof文件

(注:需要配置JVM参数-XX:+HeapDumpOnOutOfMemoryError    配置方法

检查tomcat/logs/catalina.out或tomcat/bin/smartbi.log等日志文件中是否有以下信息:


方法二:使用 top -H -p <进程号>查看

cpu占用高的线程是否gc线程(注:top命令看到的10进制,线程信息为16进制)

top 命令会显示一个动态更新的表格,其中包含了线程的详细信息:

如果某个线程的 %CPU 值很高(接近 100% 或更高),则说明该线程占用 CPU 很高。

PID

USER

PR

NI

VIRT

RES

SHR

S

%CPU

%MEM

TIME+

COMMAND

21978

root

20

0

16.3g

4.2g

18736

S

99.7

6.7

13:23.19

java

20562

root

20

0

16.3g

4.2g

18736

S

99.0

6.7

0:00.00

java

20563

root

20

0

16.3g

4.2g

18736

S

99.0

6.7

0:00.37

java

使用jstack命令打印线程打印线程教程

上述线程中有多个gc线程,以其中一个gc线程为例:

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x0000000002eb7000 nid=0x5052 runnable

20562的十六进制为5052,所以以上的示例就是gc线程占用cpu


出现内存溢出或内存不足的情况,在满足服务器配置(查看服务器配置要求)的前提下,除了java.lang.OutOfMemoryError: Metaspace,出现其他的内存溢出需要通过命令:【jmap -histo:live 进程号 >进程号.map】 生成堆快照。

若条件允许,通过命令【jmap -dump:live,format=b,file=进程号.bin 进程号】生成整个堆Dump文件。


元数据内存溢出的情况,在满足配置-XX:MaxMetaspaceSize超过1G的前提下,需要获取类加载信息分析

(注:需要配置JVM参数-XX:+UnlockDiagnosticVMOptions -XX:-DisplayVMOutput -XX:+LogVMOutput -XX:+TraceClassLoadingPreorder -XX:+TraceClassLoading -XX:+TraceClassUnloading -XX:LogFile=D:/logs/TomcatMetaspaceOOM/class_load.log)

如果非内存溢出或内存不足的情况,进入下一步。

4.收集线程信息确定是否死锁,连接池/线程池占满

通过命令:【jstack 进程号 >进程号.map】 生成线程信息

1)检查线程中是否存在死锁

搜素关键字:

Found one Java-level deadlock

2)检查线程中是否存在连接池占满

搜素关键字:

 at java.lang.Object.wait(Object.java:502)
    at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:810)

出现连接池占满的情况,需要获取服务器启动目录下【connpool-smartbi】目录中最新的文件或访问http://<server:port>/smartbi/vision/monitor/connectionpoolinfo.jsp 获取借出数据库连接后没有关闭的代码堆栈进行分析。

(注:V11需要在系统高级选项中配置ENABLE_CONNECTION_POOL_STACK_TRACE=true 开启跟踪记录)

3)检查线程中是否存在线程池占满

统计执行请求的线程数量,是否等于中间件处理请求的线程数(tomcat默认为200),所执行的线程可通过搜索“http“查找

(如以下统计“http-nio-8080-exec”关键字的数量等于200次)

如果出现以下情况,则进入下一步:

"http-nio-8080-exec-7" #131 daemon prio=5 os_prio=0 tid=0x0000000034b35000 nid=0x92c4 waiting on condition [0x00000000476ae000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000006cb829390> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:146)
    at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:33)


"http-nio-18080-exec-47" #4284 daemon prio=5 os_prio=0 tid=0x00007f473c01d800 nid=0x10fa runnable [0x00007f49c6ad0000]
java.lang.Thread.State: RUNNABLE
at sun.tools.attach.LinuxVirtualMachine.read(Native Method)
at sun.tools.attach.LinuxVirtualMachine$SocketInputStream.read(LinuxVirtualMachine.java:235)
- locked <0x000000078d1b0d68> (a sun.tools.attach.LinuxVirtualMachine$SocketInputStream)
at sun.tools.attach.HotSpotVirtualMachine.readInt(HotSpotVirtualMachine.java:214)
at sun.tools.attach.LinuxVirtualMachine.execute(LinuxVirtualMachine.java:175)
at sun.tools.attach.HotSpotVirtualMachine.executeCommand(HotSpotVirtualMachine.java:195)
at sun.tools.attach.HotSpotVirtualMachine.remoteDataDump(HotSpotVirtualMachine.java:156)
at sun.reflect.GeneratedMethodAccessor3943.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at smartbi.freequery.migrate.SmartbiVirtualMachine.remoteDataDump(SmartbiVirtualMachine.java:177)
at smartbi.freequery.migrate.ExportLog.exportThreadDumpByAttach(ExportLog.java:933)
- locked <0x00000003e8cbe9b8> (a java.lang.Class for smartbi.freequery.migrate.ExportLog)
at smartbi.freequery.migrate.ExportLog.exportThreadDumpInfo(ExportLog.java:716)
at smartbi.management.LocalManagementHandler.getCurrentThreadDumpInfo(LocalManagementHandler.java:526)
at smartbi.management.LocalManagementHandler.getThreadDumpViewInitInfo(LocalManagementHandler.java:673)
at smartbi.management.ManagementService.getThreadDumpViewInitInfo(ManagementService.java:355)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at smartbi.framework.rmi.ClientService.executeInternal(ClientService.java:188)
at smartbi.framework.rmi.ClientService.execute(ClientService.java:166)
at smartbi.framework.rmi.RMIServlet.processExecute(RMIServlet.java:233)
at smartbi.framework.rmi.RMIServlet.doPost(RMIServlet.java:144)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:682)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:765)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.extension.ExtensionFilter$1.doFilter(ExtensionFilter.java:249)
at smartbi.extension.ExtensionFilter$2.doFilter(ExtensionFilter.java:276)
at smartbi.aichat.AIChatFilter.doFilter(AIChatFilter.java:56)
at smartbi.extension.ExtensionFilter$2.doFilter(ExtensionFilter.java:276)
at smartbi.security.patch.PatchFilter.doFilter(PatchFilter.java:77)
at smartbi.extension.ExtensionFilter$2.doFilter(ExtensionFilter.java:276)
at smartbi.imitator.NetworkInterceptorFilter.doFilter(NetworkInterceptorFilter.java:88)
at smartbi.extension.ExtensionFilter$2.doFilter(ExtensionFilter.java:276)
at smartbi.extension.ExtensionFilter.doFilterInternal(ExtensionFilter.java:279)
at smartbi.extension.ExtensionFilter.doFilter(ExtensionFilter.java:127)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.freequery.filter.GZIPFilter.doFilter(GZIPFilter.java:265)
at smartbi.freequery.filter.Filter.doFilter(Filter.java:33)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.freequery.filter.ExceptionResponseFilter.doFilter(ExceptionResponseFilter.java:92)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.framework.rmi.TransactionFilter.doFilter(TransactionFilter.java:87)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.freequery.filter.CheckIsLoggedFilter.doFilter(CheckIsLoggedFilter.java:198)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.freequery.filter.CheckRefererFilter.doFilter(CheckRefererFilter.java:45)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.freequery.filter.CheckHttpMethodFilter.doFilter(CheckHttpMethodFilter.java:62)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.freequery.filter.TraceFilter.doFilter(TraceFilter.java:145)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.freequery.filter.LogFilter.doFilter(LogFilter.java:112)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.framework.RedisSessionFilter.doFilter(RedisSessionFilter.java:58)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.freequery.filter.DisableUrlSessionFilter.doFilter(DisableUrlSessionFilter.java:81)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at smartbi.framework.SmartbiApplicationFilter.doFilter(SmartbiApplicationFilter.java:71)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:177)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:543)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:135)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:698)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:367)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:639)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:885)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1688)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
- locked <0x000000078ccd8dc0> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)

四、上报问题

按以上步骤收集信息后,就可上报问题分析处理,也可根据收集信息,参照下面常见宕机原因及解决方案初步处理,一般看日志,因内存原因进程直接没了的可以先修改部署配置,其他需要分析线程快照或堆快照进一步确认原因的需上报问题分析。

五、信息分析及处理

1.分析处理系统OOM Killer或JVM崩溃

1)JVM崩溃的信息中如果没The system is out of physical RAM or swap space,可能是触发了JDK的bug需要更新JDK处理。

2)如果是JVM无法向操作系统的情况,需要确认操作系统无其他占用高内存的程序(高速缓存库及跨库联合数据源等),如果有,需要限制对应程序的最大内存使用,如无法限制,则需要考虑单独部署。

3)如果无其他占用高内存的程序,检查Xmx的设置是否超过操作系统空闲内存(注:建议保留5到10G给操作系统使用),如Xmx的设置超过操作系统空闲内存,需要减少Xmx的设置(注:减少后需要满足服务器配置,如无法满足,需要服务器扩容,增强服务器内存)。

2.分析处理JVM的OOM

使用MAT(https://eclipse.dev/mat/download/)或JVISUALVM(JDK自带)打开生成的xxx.hprof,参考heapdump简要分析步骤进行分析。

3.分析处理死锁,连接池/线程池占满

开发人员在处理死锁,连接池/线程池占满时,需要注意以下内容:

1)处理死锁

检查代码中同步锁是否必要,尽量避免在代码中使用synchronized 同步锁,如果必须使用synchronized 同步锁,则应该使用同一顺序锁定对象。

2)处理连接池占满

查看借出数据库连接后没有关闭的代码堆栈,检查对应的代码中是否有从连接池借出数据库连接的代码,如果有,需要带finally 中进行关闭,避免由于代码出现异常没有执行关闭的逻辑。

如果借出数据库连接后没有关闭的代码堆栈是借出知识库的连接,检查对应的代码是否经过smartbi.framework.rmi.TransactionFilter,如果是独立线程,线程结束后是否调用smartbi.framework.rmi.RMIModule.doEndRequest(Object)

3)处理线程池占满

查看处理请求代码堆栈,检查对应代码是否存在死循环,是否存在指数级增长时间复杂度的代码

六、常见宕机原因及解决方案

1.JVM申请内存超过服务器可用内存

宕机原因JVM向操作系统申请的内存超过了操作系统的可用内存,导致JVM无法申请足够的内存而崩溃。

宕机解决方案通常是配置不合理,一般是修改-Xmx最大堆内存到操作系统内存的75%左右,或同一台机器部署了多个比较消耗内存的应用需要根据并发情况更换部署结构。

判断依据通常会在服务器启动目录生成hs_err_pidxxx.log文件,这是java进程崩溃自动生成的日志文件。里面通常包含以下内容

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 1263696 bytes for Chunk::new
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (allocation.cpp:390), pid=30576, tid=0x00000000000094f8
#
# JRE version: Java(TM) SE Runtime Environment (8.0_172-b11) (build 1.8.0_172-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.172-b11 mixed mode windows-amd64 compressed oops)
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#

2.JVM 申请内存过多,触发操作系统OOM Killer

宕机原因因为某时刻应用程序大量请求内存导致系统内存不足造成的,这通常会触发 Linux 内核里的 Out of Memory (OOM) killer,OOM killer 会杀掉某个进程以腾出内存留给系统用,不致于让系统立刻崩溃。

宕机解决方案通常是配置不合理,一般是修改-Xmx最大堆内存到操作系统内存的75%左右,或同一台机器部署了多个比较消耗内存的应用需要根据并发情况更换部署结构。

判断依据如果检查相关的日志文件(/var/log/messages,这是linux系统的日志文件)就会看到下面类似的 Out of memory: Kill process 信息

...
Out of memory: Kill process 9682 (mysqld) score 9 or sacrifice child
Killed process 9682, UID 27, (java) total-vm:47388kB, anon-rss:3744kB, file-rss:80kB
httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
httpd cpuset=/ mems_allowed=0
Pid: 8911, comm: httpd Not tainted 2.6.32-279.1.1.el6.i686 #1
...
21556 total pagecache pages
21049 pages in swap cache
Swap cache stats: add 12819103, delete 12798054, find 3188096/4634617

3.中间件服务器内存溢出

宕机原因Java程序在运行过程中申请的内存超过了JVM的设置最大值,导致中间件服务器关键线程退出,从而导致系统无法响应。

宕机解决方案这个一般是jvm配置内存,满足不了系统的需要,如打开某张大电子表格瞬间申请内存太大申请不到就会出此问题,如进一步分析堆内存快照,确认系统无优化空间,现场也确认确实有此场景,就需加大系统内存配置及JVM内存配置(-Xmx的配置)。

判断依据通常在smartbi.log,catalina.out等日志文件中有以下信息:

02-21 09:02:25 ERROR validateConnectionInternal(smartbi.connectionpool.SmartbiPoolableConnectionFactory:379) [tid=f667fffff116dab7] - java.lang.OutOfMemoryError: Metaspace
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Metaspace
    at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_222]
    at java.util.concurrent.FutureTask.get(FutureTask.java:206) ~[?:1.8.0_222]
    at smartbi.connectionpool.SmartbiPoolableConnectionFactory.validateConnectionInternal(SmartbiPoolableConnectionFactory.java:377) ~[Smartbi-SmartbiCommon.jar:?]

4.代码出现死锁

宕机原因Java程序中两个或多个线程因互相等待对方释放资源而无限期地被阻塞,无法继续执行的现象,从而导致系统无法响应。

宕机解决方案通常是代码问题,是smartbi的死锁就需换smartbi的包,是tomcat等服务器的死锁就需升级对应的应用服务器。

判断依据通常在jstack生成的线程文件中有Found one Java-level deadlock的信息,例如:

5.数据库连接池占满

宕机原因:Smartbi程序代码在获取数据库连接后,由于数据库长时间没有返回或没有及时关闭,其他请求无法再获取数据连接导致系统无法响应。

宕机解决方案:如非并发太大或数据库不返回造成的连接池满,一般是系统某些情况未关闭连接的bug引起,需换包处理。

判断依据:通常在jstack生成的线程文件中有java.lang.Object.wait,org.apache.commons.pool.impl.GenericObjectPool.borrowObject的信息,如:

"Smartbi-Pool-4368" #32491 daemon prio=5 os_prio=0 tid=0x00007f93040a9000 nid=0x7860 in Object.wait() [0x00007f92c956d000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:810)
    - locked <0x00000003f85ddb30> (a smartbi.connectionpool.ConnectionPool$6)
    at smartbi.connectionpool.ConnectionPool$6.borrowObject(ConnectionPool.java:1463)
    at org.apache.commons.dbcp.PoolingDriver.connect(PoolingDriver.java:180)
    at smartbi.connectionpool.ConnectionPool.lambda$doGetConnection$0(ConnectionPool.java:823)
    at smartbi.connectionpool.ConnectionPool$$Lambda$490/113031938.get(Unknown Source)
    at smartbi.monitor.MetricHelper$.withSpan(MetricHelper.scala:87)
    at smartbi.monitor.MetricHelper.withSpan(MetricHelper.scala)
    at smartbi.connectionpool.ConnectionPool.doGetConnection(ConnectionPool.java:818)
    at smartbi.connectionpool.ConnectionPool.driverConnect(ConnectionPool.java:636)
    at smartbi.connectionpool.ConnectionPool.getConnection(ConnectionPool.java:964)
    at smartbi.connectionpool.ConnectionPool.getConnection(ConnectionPool.java:903)

6.中间件服务器线程池占满

宕机原因中间件服务器处理请求的线程达到了上限,并且都处于非空闲状态,无法接受新的请求导致服务器无法响应。

宕机解决方案查看线程,如非因为并发太大造成的满,可能就是单个请求耗时太长,造成系统支撑不了对应的并发需求,需调大线程池大小,或确认请求耗时长是否合理。

判断依据就是看有无空闲的处理http请求的线程,空闲的处理请求线程:

"http-nio-8080-exec-7" #131 daemon prio=5 os_prio=0 tid=0x0000000034b35000 nid=0x92c4 waiting on condition [0x00000000476ae000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000006cb829390> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:146)
    at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:33)
    at org.apache.tomcat.util.threads.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1114)
    at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1176)
    at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)


"http-nio-8080-exec-8" #132 daemon prio=5 os_prio=0 tid=0x0000000034b36000 nid=0x9670 runnable [0x00000000477a9000]
   java.lang.Thread.State: RUNNABLE
    at sun.tools.attach.WindowsVirtualMachine.connectPipe(Native Method)
    at sun.tools.attach.WindowsVirtualMachine.execute(WindowsVirtualMachine.java:82)
    at sun.tools.attach.HotSpotVirtualMachine.executeCommand(HotSpotVirtualMachine.java:195)
    at sun.tools.attach.HotSpotVirtualMachine.remoteDataDump(HotSpotVirtualMachine.java:156)
    at sun.reflect.GeneratedMethodAccessor604.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at smartbi.freequery.migrate.SmartbiVirtualMachine.remoteDataDump(SmartbiVirtualMachine.java:177)
    at smartbi.freequery.migrate.ExportLog.exportThreadDumpByAttach(ExportLog.java:933)
    - locked <0x00000006ca836a08> (a java.lang.Class for smartbi.freequery.migrate.ExportLog)
    at smartbi.freequery.migrate.ExportLog.exportThreadDumpInfo(ExportLog.java:716)
    at smartbi.management.LocalManagementHandler.getCurrentThreadDumpInfo(LocalManagementHandler.java:526)
    at smartbi.management.LocalManagementHandler.getThreadDumpViewInitInfo(LocalManagementHandler.java:673)
    at smartbi.management.ManagementService.getThreadDumpViewInitInfo(ManagementService.java:355)

7.其他

1)中间件服务器的BUG

宕机原因如宕机时,系统内存、cpu、线程信息都正常,从日志也未看出任何异常,但就是访问不了,很可能就是中间件服务器的bug,例如:tomcat低版本配置使用nio2时存在的bug,https://bz.apache.org/bugzilla/show_bug.cgi?id=66482

解决方案升级容器应用,或修改配置

判断依据恰恰是没有任何依据的时候,就会怀疑系统原因

2)系统暂停运行的假死现象

宕机原因windows命令行运行时,鼠标点击到命令行窗口,系统暂停运行造成假死现象;

解决方案在命令行窗口按一下空格键即可继续运行,一般是在部署时就不建议开启命令窗口,是以后台服务形式运行。具体请见https://wiki.smartbi.com.cn/pages/viewpage.action?smt_poid=43&pageId=89040641