Tuesday, December 21, 2010

java.lang.OutOfMemoryError

A java.lang.OutOfMemoryError is a subclass of java.lang.VirtualMachineError that is thrown when the Java Virtual Machine is broken or has run out of resources that are necessary to continue the operation of the Java Virtual Machine. Obviously, memory is the exhausted resource for a java.lang.OutOfMemoryError, which is thrown when the Java Virtual Machine cannot allocate an object due to memory constraints.

Occasionally we will meet OOME in QA or development environment, which is an error and JVM will stop running. Usually we can tell if OOME is from java heap or native memory from the error message. There are a lot of blogs/articles discussing this area, and from Google I found these articles are simple and easy to understand.
These are the list of OutOfMemoryErrors a typical java application can see:
  • java.lang.OutOfMemoryError: Perm Space 
    •  SUN JVM has one but Jrockit doesn't have an allocated permspace
    • -XX:MaxPermSize -XX:PermSize
  • java.lang.OutOfMemoryError: Java heap space 
    • jmap, jhat,  YourKit, Jprofiler (heap dump)
    • -Xms128m -Xmx1024m
  • java.lang.OutOfMemoryError: unable to create new native thread 
    • The java process size has reached its limit (OS tries to reserve space for thread stack within process address space)
    • -Xss512k
  • java.lang.OutOfMemoryError: GC overhead limit exceeded 
    • JVM took too long to free up memory during its GC process. This error can be thrown from the Parallel or Concurrent collectors.
    • -XX:-UseGCOverheadLimit
    •  The parallel collector will throw an OutOfMemoryError if too much time is being spent in garbage collection: if more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown. This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small. If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit to the command line.
  • java.lang.OutOfMemoryError: requested xxxx bytes for Chunk::new. Out of swap space
    • The process could not allocate memory as the virtual memory was full
    • The solution is to reduce the amount of heap memory you have allocated.
  • java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    • There is a memory request for an array but that's too large for a predefined limit of a virtual machine
    • We need to check the source code to make sure that there's no huge array created dynamically or statically. 
Side note: Another setting you may want to check on is the system ulimit setting. The ulimit setting controls the limit on how many file handles a single process can have open. Since the unix philosophy is to treat nearly everything as a file, this is trickier than it seems. Running out of file handles can be reported as an OutOfMemory error, which may take a while to figure out. Most flavors of unix have a reasonably high default ulimit these days, but just in case, check.

Thursday, December 16, 2010

Filesystem read-only caused ORA-01034: ORACLE not available

Introduction:
Usually ORA-01034: ORACLE not available means either db instance is not up, or ORACLE_HOME or ORACLE_SID is not set correctly from the environment you're trying to connect the db instance.

For details, see http://www.freelists.org/post/oracle-l/fixing-a-bad-oracle-install,1 
Oracle uses a proprietary algorithm that combines the ORACLE_HOME and ORACLE_SID to come up with a shared memory key, which is used at shared memory segment creation time,
i.e., when the SGA is allocated.  After that, further bequeath connections must
have the same ORACLE_HOME and ORACLE_SID defined, so that they can define the
same key value, and use it to attach to that existing SGA.  If the ORACLE_HOME
and/or ORACLE_SID is set  incorrectly, the key value will be calculated
incorrectly, and the server process will not be able to attach to the SGA
shared memory segments.


Problem:
Recently we had the same error on QA environment
ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
Linux-x86_64 Error: 2: No such file or directory


Root Cause:
The root cause is not caused by above 2 reasons, but from "Remounting filesystem read-only".


[root@mvnowdb03 ~]# cat /var/log/messages
Dec 15 10:03:14 mvnowdb03 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Dec 15 10:03:14 mvnowdb03 kernel: ata1.00: BMDMA stat 0x25
Dec 15 10:03:14 mvnowdb03 kernel: ata1.00: cmd 35/00:20:f8:18:3d/00:00:2d:00:00/e0 tag 0 dma 16384 out
Dec 15 10:03:14 mvnowdb03 kernel:          res 51/10:20:f8:18:3d/10:00:2d:00:00/e0 Emask 0x81 (invalid argument)
Dec 15 10:03:14 mvnowdb03 kernel: ata1.00: status: { DRDY ERR }
Dec 15 10:03:14 mvnowdb03 kernel: ata1.00: error: { IDNF }
Dec 15 10:03:15 mvnowdb03 kernel: ata1.00: configured for UDMA/133
Dec 15 10:03:15 mvnowdb03 kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Dec 15 10:03:15 mvnowdb03 kernel: sda: Current [descriptor]: sense key: Aborted Command
Dec 15 10:03:15 mvnowdb03 kernel:     Add. Sense: Recorded entity not found
Dec 15 10:03:15 mvnowdb03 kernel:
Dec 15 10:03:15 mvnowdb03 kernel: Descriptor sense data with sense descriptors (in hex):
Dec 15 10:03:15 mvnowdb03 kernel:         72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Dec 15 10:03:16 mvnowdb03 kernel:         2d 3d 18 f8
Dec 15 10:03:16 mvnowdb03 kernel: end_request: I/O error, dev sda, sector 758978808
Dec 15 10:03:16 mvnowdb03 kernel: Buffer I/O error on device sda9, logical block 35429835
Dec 15 10:03:16 mvnowdb03 kernel: lost page write due to I/O error on sda9
Dec 15 10:03:16 mvnowdb03 kernel: Buffer I/O error on device sda9, logical block 35429836
Dec 15 10:03:16 mvnowdb03 kernel: lost page write due to I/O error on sda9
Dec 15 10:03:16 mvnowdb03 kernel: Buffer I/O error on device sda9, logical block 35429837
Dec 15 10:03:16 mvnowdb03 kernel: lost page write due to I/O error on sda9
Dec 15 10:03:16 mvnowdb03 kernel: Buffer I/O error on device sda9, logical block 35429838
Dec 15 10:03:16 mvnowdb03 kernel: lost page write due to I/O error on sda9
Dec 15 10:03:16 mvnowdb03 kernel: ata1: EH complete
Dec 15 10:03:16 mvnowdb03 kernel: SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
Dec 15 10:03:16 mvnowdb03 kernel: sda: Write Protect is off
Dec 15 10:03:16 mvnowdb03 kernel: SCSI device sda: drive cache: write back
Dec 15 10:03:16 mvnowdb03 kernel: Aborting journal on device sda9.
Dec 15 10:03:16 mvnowdb03 kernel: ext3_abort called.
Dec 15 10:03:16 mvnowdb03 kernel: EXT3-fs error (device sda9): ext3_journal_start_sb: Detected aborted journal
Dec 15 10:03:16 mvnowdb03 kernel: Remounting filesystem read-only
Dec 15 10:03:16 mvnowdb03 kernel: __journal_remove_journal_head: freeing b_committed_data

Solution:
  1. There is suggestion to unmount the hard disk and run a fsck on it, then remount. After that, we need restart DB instances and app servers to rebuild connection pool
  2. Replace the dying (bad) disk

Wednesday, December 15, 2010

File format caused md5sum check failure

Background:
Every release we need md5 checksum to ensure the released files are integral. md5sum is a useful tool for generating and verifying md5 checksum.

Here is some examples:
C:\>md5sum --help
C:\>md5sum *.jar > test.md5
C:\>md5sum -c test.md5
release-part1.jar: OK
release-part2.jar: OK

Problem:
QA reported issue when tested out md5 with below info:
md5sum -c test.md5
: No such file or directory
: FAILED open or read
: No such file or directoryar
: FAILED open or read
md5sum: WARNING: 2 of 2 listed files could not be read

Root Cause:
The md5 format is DOS because I generated it using md5sum in Window desktop. When QA verified in Linux environment, we got error.

Solution:
  1. Generate md5 from Linux OS instead of Windows
  2. Converted generated md5 file format from DOS to UNIX (using UltraEdit etc editors)