(March 31 2006)
I've recently been on a Solaris 10 course – I've been looking forward to finding out properly about all the new features that are now available. Let's face it, Solaris 9 is still basically the same 10⁄15-year old Unix that we're all used to …
I have to be harsh. Everything is promising, the added features are integrated all the way to the middle of the kernel, there's strong evidence of good engineering input (as there is with most things from Sun that I've played with – note that I don't play with Java). But it's all half-baked, and seems to have been implemented in a complete vacuum; problems that have been experienced in the Linux world have been re-invented.
The changes are so big that you pretty much have to have exposure to the changes in order to do even basic system administration on a sol10 box, even if you were a competant sol9 admin. That sucks. What sucks worse is the quality of the changes themselves.
From what I've seen so far, if you're an OS hacker you might love the new stuff (especially DTrace, which … well, it rocks and sucks at the same time). But if you want to run a stable production enviroment, you can forget about Solaris 10. Better pray you can install Sol 9 on your new hardware for the next 5 years or so … or Linux. Any Linux. Even Mandriva …
OK, time to get specific :-
These seem to be useful – a bunch of virtual servers, that have no ability to touch hardware, all using the same kernel but with their own independant filesystems.
Except for patching, which is ugly and difficult. You have to have all zones running (albeit perhaps only at single user) and the pkgadd utility effectively re-runs the pkgadd in each zone, sequentially. If you have some zones that are halted, patchadd will start them up, patch them, then shut them down. Hope you don't have too many!
Zones are created by copying the current state of the global zone – so be careful if you have modified global with things that should not be available to children. Basically, don't use your global zone once the machine has been installed, except for setting up the other zones. Well, that's not actually a problem, as long as you know about it ;-)
On the positive side, the Fair Share Scheduling works wonderfully across zones, and is a joy to use. Except that the global zone only gets 1 share by default, and you have to write your own start-up script to change this. Perhaps the root .profile should set up nn shares when you login, and revert when you log out? :-)
The maximum length of passwords has finally increased beyond 8. Well, if you ask, that is. You can set the maximum password length up to 256. I don't know why you would set it to less than the maximum – but it appears to be that the solaris implementation of DES is limited to 8 chars. I guess you'd get problems if a Sol10 machine set a password value in your NIS, and defaulted to MD5/long password, while you still has Sol9 machines accessing it …
Least Privilege is great – assuming that you hanker for the days of VMS. Actually, it's an interesting way to let non-root users have the minimum abilities to do things.
Take ping for example – you remembered that this command is suid root, didn't you? Well, it's that way because standard users can't send/receive the ICMP ping packets (except I thought you could go raw and do it that way … never mind). Sol10's Least Privilege means you can sit back, give your users a non-suid copy of ping, and grant them the net_icmpaccess privilege … and ping will work!
Stunning integration, down to the kernel. Great engineering skills – if you don't have a priv, you'll get a decent user-level error message telling you about the problem – and the ppriv command will enable the user to discover exactly what they're missing!
But you grant privs per user, and although you can do them on the fly, basically they're permanent. You can't just grant net_icmpaccess for /usr/sbin/ping, it has to be for the whole user. You want to manage these privs? They're effectvely ACLs, and the one thing you should know about ACL schemes is that you need a f#$%=#ing good management interface, or else the whole thing goes to pot. And you don't get a management interface on sol10.
I have the feeling that “process projects” might help, but haven't had time to look into those yet.
UFS will now go happily up to multiple terabytes. I guess that's cool. Each inode entry on MTBUFS is now a megabyte. That just makes me smile.
Naturally, these changes mean that MTBUFS cannot be used by 32bit machines, or Sol9. No great loss – this is a period of change, so think carefully before you convert your filesystems :-)
But don't run fsck – pray that your journalling works. fsck is now documented as taking between 4 days and a week to run on a large filesystem. A corrupt filesystem is now a reason to invoke full disaster recovery. You have tested your DR recently, haven't you?
The zettabyte file system. Details have been removed from the official training materials. Insufficient testing – the words “filesystem bugs” made too many people run away screaming, and rightfully so!
The fault management framework doesn't really apply to you, unless you have a big big machine (well, I have a login on a couple of E25Ks, so I guess I qualify!)
It will collect transient error reports from your hardware, and uses hardcoded tolerances to decide when to panic. It will take an item of hardware off-line if it can find an alternative to run your system on (allegedly even one half of a DIMM!) – but it won't run any tests over it. Well, I guess a production system isn't the right place to do any sort of testing – just yank the board and replace it. You have support from Sun for all that testing crap, don't you?
So I think I like the fault management stuff, even if I have no control over it – because on machines of that size, I don't want to get involved.
Holy crap, where's =init.d=‽‽
It's still there, gentle student – but it's now invoked by a completely different process. Now, instead of your system starting up sequentially, via runlevels and a whole bunch of little shell scripts, you get a parallellised, optimised, dependency-aware meta-server, that will actively keep a service running by restarting if it dies.
As a downside, all the config is controlled by multiple XML files, read by a parser that seems to silently die if there's extra whitespace in there (syntatically valid whitespace, by the way) – and the whole lot is compiled into a single binary database that is so damn flakey that you can't even take a copy while your OS is running!
Microsoft called it the registry. What's Sun's excuse?
Actually, you can copy the repository.db file - you have to halt the service daemon, then kill the config daemon … then copy the file, and unhalt the service daemon, which will respawn the config daemon immediately. Yeah, right.
And your startup scripts themselves? They're logged all over the place, stdout goes into one log file, you may want to manually send output to /dev/console – if a service fails to start the svcadm command WILL NOT TELL YOU even though it knows the failure has occured.
You can't look at $?, you have to explicitly look at the service status to find out if it's started. And hope you haven't confused a broken startup with one that was simply slow to start (like an fsck on an MTBUFS filesystem, eh?).
How often do you reboot your production Solaris machines? All we really needed was a facility to monitor and re-run daemons that died – and that could have been done by a simple subscribe flag in the existing startup scripts, or by hooking into the system monitoring facilities. For many sites, that's already true. Remind me why we needed SMF, will you?
Aah, DTrace. It's astounding just how much detail you can extract from the running kernel. Really, really fantastically wonderful levels of detail. Stunning stuff – we'll all have to start reading the source again to make sense of it! (and hasn't opensolaris.org got some excellent source-code viewing? I want some of the OpenGrok stuff for my old perl scripts!!)
I'm literally blown away by how damn easy it is to extract real-time, low-down internal stats from the machine. To watch every filesystem read operation, graphing buffer size against time to execute, against physical location on the disk? Amazing!!!
I'm literally blown away by how crappy and brain-dead the D language is, that implements all your probes.
I'll accept that the problem domain for this new language isn't straightfoward – it has to exist in a multi-threaded world, compile up to run at as low an overhead as possible, to be data-safe …
But all these are problems that have already been addressed in the real world, and solved. D has invented a bastard mix between a crippled awk and what might once have been C … and the name “D” has already been used! Wikipedia lists six programming languages called D already, and hasn't even heard of Filetab-D!
I suspect that the genius who engineered DTrace really really knew his kernel internals backwards, upside-down, reversed, in black-and-white, standing on his head … but whoever came up with D hasn't seen anything except Sun code for the last 20 years. Except Java. Which probably explains something.
It's at this point that someone reveals the real history of DTrace; I know, I've already read up on it. And everything I've said is wrong, isn't it? The people who made DTrace are pretty cool, aren't they? :-)
BUT D IS STILL CRAPPY AND UGLY
Please, someone get DTrace working with a decent front-end language!!! And NOT JAVA!!!
As I suspected, D is deliberately lacking in programming features to make it safe – after all, it's frobbing with your kernel and data structures. And you're probably running it on a production box, so the footprint of inspection needs to be tiny … it's a very very difficult problem domain, I know I couldn't get within even a fraction of a percentage point of a real solution.
Small DTrace scripts completely outstrip huge problem-specific binaries in their ability to expose useful and accurate internal system data. Hey – we know lsof shows all open files, right? How about watching every read and write operation in real-time across your machine?
dtrace -n 'io:::start { printf("%s %s %s
", execname,
args[0]->b_flags & B_READ ? "reads" : "writes",
args[2]->fi_pathname ) }'
See figure 4 on this ACM DTrace article for the output :-) I don't have a Sol10 machine online to show you …
So, we now have a reasonable specification for setting up Quality of Service on IP, using the IP Generic Packet Classifier and the Differentiated Services Code Point Marker. Some of the ipgpc specs are nice, allowing QoS to be set on a userlevel basis, not just on the traditional sport/dport.
However, QoS doesn't play nicely with MultiData Transmission (which mashes up multiple small packets into a bigger one for transmission – like the old mailing list digest posts). So don't be tempted to declare a QoS for some bulk data transfer that might secretly already be using MDT :-)
I'm not quite sure yet how sol10 decides to use MDT for a connection (and more importantly, how it decides to not use it), but it seems to be fine!
NFS version 4 seems to be a damn sight more efficient – if doesn't have to go via rpc, the protocol on-the-wire is significantly tighter, the pseudo-file system hides non-explicitly exported files far better.
The ‘write delegation’ allows a client to make writes to a file without necessarily sending every change back to the server – if another client wants to write, the server will recall the delegation and cause the original client to flush immediately. Excellent stuff!
Of course, if you attempt to write to a delegated file without using NFS, you stand a good chance of corrupting it. So don't go touching your exported files on the server directly! Plus delegation is on by default. Caveat depascor!
This article probably wasn't very widely distributed, but did find its way onto the OpenSolaris discussion list. Here's the "discussion"
There are valuable and interesting technical clarifications and refutations in that discussion, so it's worth reading.