<?xml version="1.0"?>
<rss version="2.0">
<channel>
  <title>Coding the Architecture - rannett</title>
  <link>http://www.codingthearchitecture.com/authors/rannett/</link>
  <description>Software architecture for developers</description>
  <language>en</language>
  <copyright>Coding the Architecture</copyright>
  <lastBuildDate>Mon, 09 Jan 2012 09:02:08 GMT</lastBuildDate>
  <generator>Pebble (http://pebble.sourceforge.net)</generator>
  <docs>http://backend.userland.com/rss</docs>
  
  
  <item>
    <title>Features vs Behaviour</title>
    <link>http://www.codingthearchitecture.com/2011/11/25/features_vs_behaviour.html</link>
    
      
        <description>
          &lt;p&gt;
I&#039;ve recently had a bug I raised with a third party software supplier downgraded from high to low importance. No one likes having their bugs downgraded (it probably shows you what a nerd I am by taking this personally) but what surprised me was the reason. The bug was causing lots of misleading errors to be reported but the bug was deemed to not affect core &#039;functionality&#039; as the feature worked from an end user&#039;s perspective. However it has a negative effect on our ability to operate the software system.
&lt;p&gt;
This seems to be one of the big differences between software developers and software architects. Software developers/programmers think in terms of features and a feature tends to be defined in terms of cause and effect e.g. a user clicks on a button and the system responds. A defect or bug is simply when the system does not give the response as defined by the specification.
&lt;p&gt;
An architect should consider the holistic behaviour. Not only thinking about a resultant action but the complete behaviour and life-cycle of the action. From simple (and measurable) items like timings (latency, response etc) to more complex system behaviour such as auditing, logging, replication etc.
&lt;p&gt;
Most software development processes revolve around features. Therefore when a bug is raised it HAS to be registered against a feature. This is then passed to a developer who either rejects it as &#039;working&#039; or downgrades it as not being core or having a &#039;work around&#039;. However the work around might be unacceptable such as waiting longer, bad logging/auditing or a side effect on a completely different part of the system.
&lt;p&gt;
In my experience it is rare for a development process to consider system or architecture issues.
&lt;p&gt;
Does your software development process allow you to raise and track non-functional issues and how do you do this? Do tools (such a JIRA) help or hinder with this? These issues will cut across many features - should they be raised against a set of features or have a single bucket that they get put in? Most importantly, how can I get my bug upgraded when they have a feature based bug reporting process?!
&lt;p&gt;
As a side note it appears to me that Simon Brown&#039;s and Robert Martin&#039;s &lt;a href=&#034;http://www.infoq.com/news/2011/11/Debate-The-Annoying-Detail&#034;&gt;debate&lt;/a&gt; is partially about the differential between product features and system behaviour.
&lt;/p&gt;
        </description>
      
      
    
    
    
    <category>What is software architecture?</category>
    
    <category>How do you deliver software architecture?</category>
    
    <comments>http://www.codingthearchitecture.com/2011/11/25/features_vs_behaviour.html#comments</comments>
    <guid isPermaLink="true">http://www.codingthearchitecture.com/2011/11/25/features_vs_behaviour.html</guid>
    <pubDate>Fri, 25 Nov 2011 22:13:00 GMT</pubDate>
  </item>
  
  <item>
    <title>Is caching an &#039;Architectural Smell&#039;?</title>
    <link>http://www.codingthearchitecture.com/2011/10/02/is_caching_an_architectural_smell.html</link>
    
      
        <description>
          &lt;p&gt;Kent Beck introduced the concept of &#034;Code Smells&#034; while working on Martin Fowler&#039;s famous &lt;a href=&#034;http://martinfowler.com/bliki/CodeSmell.html&#034;&gt;Refactoring&lt;/a&gt; book and I think that most people would agree with many of the stinks he identified. Many of us probably also use tools such as checkstyle to automatically identify such things as excessively long methods, dead code etc. To those not familiar with the concept please have a quick read from the link above but the basic premise is that
&lt;/p&gt;

&lt;blockquote&gt; A code smell is a surface indication that usually corresponds to a deeper problem in the system.&lt;/blockquote&gt;

&lt;p&gt;
Though we have to remember that just because some code has a &#039;smell&#039; doesn&#039;t mean it&#039;s bad, just that it&#039;s worth investigation and justification.
&lt;/p&gt;
&lt;p&gt;
We can take the concept to the next layer of abstraction and identify a number of &#034;Architectural Smells&#034;. A recent &lt;a href=&#034;http://www.enigmastation.com/2011/09/20/caches-and-what-they-mean-for-the-cloud&#034;&gt;blog article&lt;/a&gt; touched upon one of mine - the (over) use of Caches.
&lt;/p&gt;
&lt;p&gt;
I&#039;ve had terrible trouble with caches in the past. They can introduce bugs which are difficult to reproduce as they rely upon operation timing to be visible. They are similar to bugs you find in concurrent systems, where the issue only occurs every few thousand operations and aren&#039;t present when you attach a debugger or logging. Like all performance tuning a cache should be introduced AFTER you have determined that there is an problem. However they can be added so easily that developers throw them in whenever they can. Of course if your cache hit is low then your performance can actually degrade after adding a cache.
&lt;/p&gt;
&lt;p&gt;
Maybe you agree or not with the above (and I know I&#039;ll be flamed for saying it) but why do I consider caches to be an Architectural Smell?
&lt;/p&gt;
&lt;p&gt;
In a perfect system the business logic will always have access to the data it needs. The access (local or remote) will fit comfortably into the non-functional requirements and the data it uses will be from the primary source/system of record and not be stale.
&lt;/p&gt;
&lt;p&gt;
Back in the real world the system is not used in the way it was originally designed for, by many more users than anticipated and they can&#039;t wait for anything.
&lt;/p&gt;
&lt;p&gt;
The temptation is to introduce a cache at each layer there is an issue. They can be very easy to introduce (Spring will allow you to do this with a couple of lines of configuration for your data access components) and the user&#039;s perception of response can increase dramatically. Is it a free lunch? If you look closely at the options available with caching systems you&#039;ll see all sorts that you might associate with databases - which is not surprising as they are really a mini database. Have you considered data staleness, dirty reads, dirty writes, update schedules? Will all clients of the data see the same data at the same time? Can updates be missed? Does it listen for updates or poll? Is data coalesced, grouped or skipped?  Depending on the use of the data you might answer these questions and decide that caching is an effective and accurate solution - great! If it&#039;s not then the cache will introduce the kind of bugs I described.
&lt;/p&gt;
&lt;p&gt;
Either way it is still an Architectural Smell. Perhaps the best solution is to re-examine how data is distributed and accessed throughout the system. For example:
&lt;/p&gt;
&lt;p&gt;
&lt;ul&gt;
&lt;li&gt; Maybe a monolithic database sitting at the center of the system isn&#039;t the best solution and perhaps you need multiple database with different responsibilities? (Issues with monolithic, remote databases are a common reason for needing caches).
&lt;/li&gt;
&lt;li&gt;Maybe an asynchronous messaging system with multiple messages being processed would work better than a single request/response system?
&lt;/li&gt;
&lt;li&gt;Perhaps data associated with a request should be sent through the system with the request itself (enriched request).
&lt;/li&gt; 
&lt;li&gt;Should some data (e.g. static) be explicitly kept locally rather than requested and cached? 
&lt;/li&gt;
&lt;li&gt;Should some data have its encoding changed? Moving from/to xml is very time consuming.
&lt;/li&gt;
&lt;li&gt;Can data be request in larger or smaller blocks to reduce overheads? Calling a database in a loop is a common problem. 
&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;
&lt;p&gt;
I appreciate that this will involve a lot more work than a few lines of configuration but may help architectures to evolve logically rather than become a series of hacks and bolt ons. Introducing a cache is an architectural decision and not a coding one.
&lt;/p&gt;
&lt;p&gt;
What are your favourite Architectural Smells we should all look for? I&#039;ve already mentioned another of mine - &#034;XML everywhere&#034;.
&lt;/p&gt;


        </description>
      
      
    
    
    
    <category>What is software architecture?</category>
    
    <comments>http://www.codingthearchitecture.com/2011/10/02/is_caching_an_architectural_smell.html#comments</comments>
    <guid isPermaLink="true">http://www.codingthearchitecture.com/2011/10/02/is_caching_an_architectural_smell.html</guid>
    <pubDate>Sun, 02 Oct 2011 19:24:00 GMT</pubDate>
  </item>
  
  <item>
    <title>Jitter</title>
    <link>http://www.codingthearchitecture.com/2011/09/06/jitter.html</link>
    
      
        <description>
          &lt;p&gt;
On CTA we often talk about non-functional requirements and how this can drive the architecture of a system. Most of these cover issues of desired response time and capacity (latency, throughput, storage etc) but I believe that Jitter is a metric which is either forgotten or unknown to some software engineers – even though it&#039;s essential for hardware engineers.
&lt;/p&gt;
&lt;p&gt;
The most basic definition would be “variation in response”. In other words, the response, transmission or latency in a system will not be a constant. In many systems this variation goes unnoticed as it is small compared to the response itself. However I believe that as the systems we deal with become more performant and demanding, software engineers will need to understand, measure and tune this.
&lt;/p&gt;
&lt;p&gt;
A quick web-search of jitter shows the term being used to cover performance degradations due to activities such as garbage collection or unexpected user actions. This seems to greatly annoy telecommunications and hardware engineers who would argue that these are predictable, system and user events which could just be turned off – although ignoring your users and asking for a machine with infinite memory may get you fired. They would argue that true Jitter is an unpredictable variation in response whose occurrence frequency follows a normal distribution. Like most effects that are normally distributed it is caused by cumulative random events, most of which are due to the switching actions. Personally, I think you should measure whatever makes sense in your system.
&lt;/p&gt;
&lt;p&gt;
Jitter means that hard limits for latency can be statistically likely but not guaranteed. Specifications such as:
&lt;/p&gt;
&lt;blockquote&gt;
“The system should respond to action X within 200ms”.
&lt;/blockquote&gt;
&lt;p&gt;
Should be challenged and replaced with statements like:
&lt;/p&gt;
&lt;blockquote&gt;
“The systems should response to action X within 200ms for 95% of requests”
&lt;/blockquote&gt;
&lt;p&gt;
Of course we are making the implicit assumption that this is normally distributed and the system WILL, eventually, respond. You might want to explicitly state that all actions will be executed but stating a hard limit for all responses means that if you measure for long enough, you&#039;ll break your specification.
&lt;/p&gt;
&lt;p&gt;
Jitter can be quite obvious on messaging based systems that contain many hops. This will also tend to have a nice, Gaussian distribution due to the cumulative random delays.
&lt;/p&gt;
&lt;p&gt;
The best way to measure Jitter is to set up a standard test where you can sequentially fire a large number of requests into your system and measure the response time. However, rather than simply add these up and find an average we need to count how many responses are in a set of timing windows e.g. how many responses fall between 5ms-10ms, 10ms-15ms, 15ms-20ms etc. If we plot the number in each section we should see our expected Normal distribution.
&lt;/p&gt;

&lt;p&gt;
Reducing Jitter is subtly different from reducing latency itself. You are trying to reduce the variation in the response times or &#039;squeeze&#039; the response profile.
&lt;/p&gt;
&lt;p&gt;
Often when you reduce latency you will reduce the Jitter proportionally however this is not always the case. For example, network hops are a common cause of latency but each hop will increase the Jitter as well – you are accumulating more random delays with each hop. If you change the physical architecture of your system to have more but quicker hops you will almost certainly increase the jitter (remember this is the variation in response and not the actual response) even if the average latency is lower. It IS important to measure it and monitor as the system evolves.
&lt;/p&gt;
&lt;p&gt;
Simon is covering a number of related issues in his Skills Matter tutorial
&lt;a href=&#034;http://www.codingthearchitecture.com/2011/08/25/load_testing_for_developers.html&#034;&gt;Load Testing for Developers&lt;/a&gt;.

&lt;/p&gt;
        </description>
      
      
    
    
    
    <category>How do you define software architecture?</category>
    
    <comments>http://www.codingthearchitecture.com/2011/09/06/jitter.html#comments</comments>
    <guid isPermaLink="true">http://www.codingthearchitecture.com/2011/09/06/jitter.html</guid>
    <pubDate>Tue, 06 Sep 2011 08:02:41 GMT</pubDate>
  </item>
  
  <item>
    <title>Data Integrity and System Design</title>
    <link>http://www.codingthearchitecture.com/2011/04/15/data_integrity_and_system_design.html</link>
    
      
        <description>
          &lt;p&gt;Having been involved in several upgrade projects over the last few years, one thing I&#039;ve often noticed is the poor quality of data that can be present in a large and long running system. This can present problems for upgrading and usually means that you have to spend quite some time fixing the data first.
 
&lt;p&gt;Upgrading is difficult and causes regression tests to fail as:
 
&lt;li&gt; The new system may have more data checking and refuse the old data.
&lt;li&gt; The new system may be more precise e.g. not rounding a number, taking a sign into account etc.
&lt;li&gt; The new system may be using data that was not used before e.g. data entry staff not bothering to enter a product&#039;s weight into a form as &#039;it is never used&#039;.
&lt;li&gt; Years of copy and paste can leave a vast amount of junk that fail consistency checks.
 
&lt;p&gt;After you have corrected the data for upgrade, the original system has much higher quality data and other issues and inconsistencies have been solved. In a recent system we also saw large performance improvements due to duplicate and junk data being removed. On another system we saved the operations staff may hours work a week as the data improvements meant a large number of post report corrections were no longer needed.
 
&lt;p&gt;So why isn&#039;t this analysis done on a regular basis to help keep a system healthy? The main reason is simply that it&#039;s just too hard for the operations staff to do. Therefore when you&#039;re designing a system you should take this into account and enable these kinds of maintenance tasks. This involves reporting and having tools that can correct sets of problematic data.
 
&lt;p&gt;Some things to consider:
 
&lt;li&gt; How easy is it to identity and delete orphaned data i.e. If you can&#039;t get to some data is it required?
&lt;li&gt; Can a user identity data that has not been used for a long time? Can they then archive it?
&lt;li&gt; Can you identify identical or similar data? A common example is user information that differs only by capitalisation e.g. an address.
&lt;li&gt; Can the user run arbitrary consistency checks that go beyond the database rules? E.g. I&#039;ve recently written a tool to allow an operations manager to run xpaths over data to check for bad bookings.
&lt;li&gt; Can the user bulk load sets of missing or corrected data?
 
&lt;p&gt;Please don&#039;t rely on database tools to do this as your operational staff probably won&#039;t know how to use them and your DBAs don&#039;t understand the business domain to analyse the data. You need tools at the appropriate level for the appropriate people and consider the complete lifecycle of your product.
        </description>
      
      
    
    
    
    <category>What is the the role of a software architect?</category>
    
    <comments>http://www.codingthearchitecture.com/2011/04/15/data_integrity_and_system_design.html#comments</comments>
    <guid isPermaLink="true">http://www.codingthearchitecture.com/2011/04/15/data_integrity_and_system_design.html</guid>
    <pubDate>Fri, 15 Apr 2011 14:29:43 GMT</pubDate>
  </item>
  
  <item>
    <title>Maintainable Systems 2</title>
    <link>http://www.codingthearchitecture.com/2010/09/29/maintainable_systems_2.html</link>
    
      
        <description>
          &lt;p&gt;
A little while ago I wrote a piece about &lt;a href=&#034;http://www.codingthearchitecture.com/2010/01/29/designing_maintainable_systems.html&#034;&gt;maintainable systems and upgrades&lt;/a&gt;. My own upgrade project has been progressing slowly and I&#039;m going to write a few more thoughts.
&lt;/p&gt;&lt;p&gt;
A successful system might (hopefully) be around for many years. It&#039;s highly likely that such a system is going to need upgrading periodically. The user&#039;s requirements might change over time, different users might want to use the system, external interfaces may change and, of course, software salespeople will want a reason to charge upgrade fees (I&#039;m such a cynic).
&lt;/p&gt;&lt;p&gt;
If you intend your system to be successful then your design should allow for upgrades. I&#039;m currently involved in an upgrade project and it&#039;s painful. I also have to admit that when designing systems from scratch I&#039;ve made many of the same mistakes and probably made the user&#039;s life difficult. So here are some of my recent problems that I should remember when I&#039;m back on the other side.
&lt;/p&gt;&lt;p&gt;
It&#039;s not a clean install so there is going to be data. Can you migrate all the data in the system? This seems obvious but be careful with accuracy of numbers, text data in other languages and empty and null fields. Consider what &lt;i&gt;could&lt;/i&gt; be there and not just what you think &lt;i&gt;should&lt;/i&gt; be there. Don&#039;t be surprised if a free text field, where you were expecting a number, has the word “five” - or in my case 3M for 3,000,000.
&lt;/p&gt;&lt;p&gt; 
How long will the upgrade take to perform - in particular per data item? Your users may have been running the system for years and have a lot of data. If they are using it in a way that differs from your expectations then they may have a large number of items where you expect a few. In particular make sure that there are no manual steps per data item e.g. having to edit something in a GUI. If it takes 5 minutes to do but I have 5000 items then it would take weeks.
&lt;/p&gt;&lt;p&gt; 
Have you changed an implicit API? Common examples include log files and database schemas. I like to use a log monitoring tool to find exceptions but it is also common to trigger external processes from a log indicating a certain point has been reached. Your end users may have also written SQL scripts to interrogate your back-end database and produce reports. Don&#039;t blame your users/customers for doing this as it probably indicates a gap in your product!
&lt;/p&gt;&lt;p&gt; 
Be careful when you &#039;improve&#039; a GUI. It may be much better but the end users will now need to be re-trained. Like the data migration problem this can be a problem of scale. Telling a system administrator where to enter new user data is not hard but if it&#039;s the interface to a Point Of Sale terminal used by 20,000 shop assistants then the costs are high. Having said this I do believe that if you do make a change then you should stick to it – leaving old, deprecated screens as well as the new ones leads to a support and training nightmare. The current product I&#039;m upgrading has four different GUI screens for one type of data element because different customers have either demanded a new screen or refused to stop using an old one. They all work differently and have different bugs in...
&lt;/p&gt;&lt;p&gt; 
Does a change in software process imply a change in the business process? Your users/customers may have adapted the way they work around your software so any improvements you make could force other changes. This could meet with resistance.
&lt;/p&gt;&lt;p&gt; 
Lastly, please consider the final stage of the upgrade process – testing! You&#039;ve probably/hopefully performed enough tests so you&#039;re convinced it works in the way you expect but your customers may have different expectations. They will almost certainly want to perform their own tests. My current project includes a third party finance application. What I want to do is produce reports for the same point in time in the old and the new systems and compare the totals and individual line items between the reports. If I can&#039;t generate comparable reports then I&#039;d have to do this comparison by checking the individual lines. If I do it manually I&#039;ll get a much lower level of coverage and the edge cases might be missed. If I can&#039;t do the comparisons at a fine grain level then I can&#039;t track down where any problems are occurring.
&lt;/p&gt;&lt;p&gt;
My summary is below but I&#039;d love you to send me some additions.
&lt;/p&gt;&lt;p&gt;
&lt;ul&gt;
&lt;li&gt;Minimise manual upgrade actions&lt;/li&gt;
&lt;li&gt;Talk to your users/customers&lt;/li&gt;
&lt;li&gt;How do they actually use the product?&lt;/li&gt;
&lt;li&gt;Can you get a copy of their data to do some test runs?&lt;/li&gt;
&lt;li&gt;The customer is always right. Any deviation from your expectations is free market research&lt;/li&gt;
&lt;li&gt;Make it easy for your users/customers to do their own testing&lt;/li&gt;
&lt;/ul&gt;
        </description>
      
      
    
    
    
    <category>How do you deliver software architecture?</category>
    
    <comments>http://www.codingthearchitecture.com/2010/09/29/maintainable_systems_2.html#comments</comments>
    <guid isPermaLink="true">http://www.codingthearchitecture.com/2010/09/29/maintainable_systems_2.html</guid>
    <pubDate>Wed, 29 Sep 2010 07:54:47 GMT</pubDate>
  </item>
  
  <item>
    <title>Fail Safe</title>
    <link>http://www.codingthearchitecture.com/2010/03/23/fail_safe.html</link>
    
      
        <description>
          &lt;p&gt;
One of the most misunderstood engineering terms is &#039;fail safe&#039;. Most people from a non-engineering background (including many software developers) believe it means something won&#039;t fail. Last week even the Economist used it incorrectly. 
&lt;/p&gt;
&lt;p&gt;
A &#039;fail safe&#039; device/system is &lt;b&gt;expected&lt;/b&gt; to eventually fail but when it does it will be in a safe way. Classic examples include the brakes on trains that engage when they fail and ratchet mechanisms in lifts/elevators so they can&#039;t drop if the cable breaks. Well engineered physical devices will state their Mean Time Between Failure (MTBF) and define how they can fail and what happens when they do. A well maintained physical device may never fail over its lifetime but you know what will happen if it does. 
&lt;/p&gt;
&lt;p&gt;
A fail safe physical device may also define what occurs when a user error causes it to behave in an undesired manner. For example the “dead man handles” in lawn-movers or electric drills. I own an angle-grinder and in order to turn it on I have to flick a switch and then pull a trigger. Importantly, if I let the trigger go the cutting blade is stopped. This means that if I drop it I&#039;m much less likely to lose a foot. When the trigger is released the switch is also reset, making it impossible for the trigger to be pressed by bouncing off an object.
&lt;/p&gt;
&lt;p&gt;
As there is no physical wear-and-tear on a software system the concept of MTBF is arguably not applicable. However software systems can and do fail all the time, so perhaps it&#039;s surprising that many software systems I&#039;ve experienced don&#039;t cope with failure very well or have defined actions when they fail. For example the following may happen:
&lt;/p&gt;
&lt;p&gt;
&lt;ul&gt;
&lt;li&gt;Underlying hardware failure. Networks and external disks are the ones I encounter most.&lt;/li&gt;
&lt;li&gt;External system failure. Obviously your system is perfect but external systems you rely on start to feed you garbage.&lt;/li&gt;
&lt;li&gt;User error. If you create an idiot proof system then I guarantee they will employ a better idiot.&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;
&lt;p&gt;
It&#039;s tempting to try to correct a failure situation and keep on running but this can lead to a system getting into an unknown state and creating more issues. For example:
&lt;/p&gt;
&lt;p&gt;
&lt;ul&gt;
&lt;li&gt;The network is not responding but you keep on processing inputs and queuing outputs hoping it comes back. Your caches and disks fill up affecting other systems. Eventually it does come back on line and your system stops responding as it processes hours worth of stale data.&lt;/li&gt;
&lt;li&gt;An external data provider starts sending blanks in a numeric field. A developer had previously decided to &#039;interpret&#039; empty as a zero (whereas it was missing data) and this fed through a banks pricing systems, was forwarded onto other system which then tried to execute buys (these as they were obviously a bargain at zero!)&lt;/li&gt;
&lt;li&gt;In finance we worry about &#039;fat fingers&#039; where a trader hits the wrong keys and buys a 12 million rather than 1 million...&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;
&lt;p&gt;
All of the above are real examples I have come across. How would I have changed the failure handling? I prefer to put the system into a known, safe state if possible.
&lt;/p&gt;
&lt;p&gt;
&lt;ul&gt;
&lt;li&gt;Put limits on anything you do for recovery situations e.g. retry only three times, put a time limit on caches etc. Don&#039;t continually do something that isn&#039;t working.&lt;/li&gt;
&lt;li&gt;Don&#039;t make generic assumptions about correcting data across a system. If it&#039;s not a good input then fail that input as you have no idea what it really means and you are hiding the error. Note that I&#039;m not suggesting the entire system should be suspended but the transactions that are in error should be suspended and reported upon.&lt;/li&gt;
&lt;li&gt;User inputs are often sanity checked but “are you sure” dialogs are automatically clicked (without reading them) or the “never show this again” checkbox is selected. Ultimately, there is only so much you can do to save the user from themselves but you might want to save an audit of the user&#039;s decisions...&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;
&lt;p&gt;
It&#039;s important to not just put the system (or transaction) into a safe state but to also inform those that can resolve the situation. As developers we often write
&lt;/p&gt;
&lt;p&gt;
LOG.warn(“Transaction X has failed”)
&lt;/p&gt;
&lt;p&gt;
and think nothing more about it. It&#039;s amazing to use a reporting tool like Splunk on a mature system and extract all the worrying messages. Would it be more appropriate to send an email, pager message, text message or change a dashboard status etc?
&lt;/p&gt;
&lt;p&gt;
We need to design the error reporting and monitoring services up front and define how the operators should be kept informed. We also need to allow the operators to resolve issues speedily and safely.
&lt;/p&gt;
&lt;p&gt;
To conclude:

&lt;ul&gt;
&lt;li&gt;How can a system fail?&lt;/li&gt;
&lt;li&gt;What safe state can be entered?&lt;/li&gt;
&lt;li&gt;How can the failure be reported?&lt;/li&gt;
&lt;li&gt;How can the issue be resolved?&lt;/li&gt;
&lt;/ul&gt;
&lt;/p&gt;
        </description>
      
      
    
    
    
    <category>What is software architecture?</category>
    
    <comments>http://www.codingthearchitecture.com/2010/03/23/fail_safe.html#comments</comments>
    <guid isPermaLink="true">http://www.codingthearchitecture.com/2010/03/23/fail_safe.html</guid>
    <pubDate>Tue, 23 Mar 2010 21:13:00 GMT</pubDate>
  </item>
  
  <item>
    <title>Designing Maintainable Systems</title>
    <link>http://www.codingthearchitecture.com/2010/01/29/designing_maintainable_systems.html</link>
    
      
        <description>
          &lt;p&gt;
I&#039;m currently involved in a project to upgrade a third party piece of software and it&#039;s apparent that when the software was originally designed, the upgrade process was not considered. This became obvious when we totaled up the time required to perform, configure and post-release test the upgrade - it came to over three days of work. This was not even taking into account any rollback times (which is fortunately simplified these days by the use of virtualisation).
&lt;/p&gt;
&lt;p&gt;
The software is used heavily from Monday to Friday so we wanted to upgrade over a weekend. The vendor suggested we perform an upgrade on a parallel system and then get the users to re-enter all the data into the new system that was missed - you can imagine how well that would have gone down. This would also mean trying to post-release, regression test two systems that are live, being used and not in sync.
&lt;/p&gt;
&lt;p&gt;
Software almost always needs updating/upgrading (unless it&#039;s control software for a deep space probe!) The ability and consequence of upgrading should be considered as part of the design and development process. Questions to ask include:
&lt;/p&gt;
&lt;p&gt;
&lt;ul&gt;
&lt;li&gt; Can an upgrade be performed in parallel to a live, running system and how does a switchover occur?
&lt;li&gt; Will a system need to be taken down for any upgrades and for how long? How does this affect your Service Level Agreements?
&lt;li&gt; How easy will any upgrade be to rollback? Errors occur!
&lt;li&gt; Can you upgrade parts of the systems or does everything have to be done at once?
&lt;li&gt; What is the effect on any users? Will they need to log out first etc? Will they lose any work if they fail to follow your procedures?
&lt;li&gt; How easy will it be to test the upgraded system to determine success? Your notice of failure shouldn&#039;t be an angry user phone call.
&lt;/ul&gt;
&lt;/p&gt;
&lt;p&gt;
Some simple tools can make all the difference. Most of my work is on financial applications and I like to run regression reports between systems for important points e.g. End-of-year. However it&#039;s often very difficult to get data out of systems to perform simple comparisons!
&lt;/p&gt;
&lt;p&gt;
Sensible configuration management is often missing. If I&#039;ve upgraded and configured new features in my pre-production environment I really shouldn&#039;t have to repeat the process from scratch in production. Manual processes are prone to errors and ideally once I&#039;ve prepared for an upgrade I should just hit a &#039;go&#039; button and sit back.
&lt;/p&gt;
&lt;p&gt;
In my experience very few software developers are aware of IT Service Management (ITSM/ITIL). In particular we should be aware of the Change Management, Release Management and Configuration Management roles that support staff have. If you want to read about ITSM/ITIL then the &lt;a  href=&#034;http://en.wikipedia.org/wiki/Information_Technology_Infrastructure_Library&#034;&gt;wiki page&lt;/a&gt; is a good place to start. 
&lt;/p&gt;
&lt;p&gt;
Some of the processes of ITSM may strike agile developers as being heavy-weight but this doesn&#039;t stop you developing the system in an agile manner, it just means that it can be deployed within a formal environment. 
&lt;/p&gt;
&lt;p&gt;
An architect should be aware of how the software fits into the organisation. So remember that your ‘users’ aren&#039;t just the end users but also the support staff who&#039;ll be maintaining your system for the next ten years!
&lt;/p&gt;
        </description>
      
      
    
    
    
    <category>What is software architecture?</category>
    
    <comments>http://www.codingthearchitecture.com/2010/01/29/designing_maintainable_systems.html#comments</comments>
    <guid isPermaLink="true">http://www.codingthearchitecture.com/2010/01/29/designing_maintainable_systems.html</guid>
    <pubDate>Fri, 29 Jan 2010 13:04:00 GMT</pubDate>
  </item>
  
  <item>
    <title>Modifying Open Services</title>
    <link>http://www.codingthearchitecture.com/2008/11/14/modifying_open_services.html</link>
    
      
        <description>
          &lt;p&gt;
There&#039;s been a huge push recently towards service oriented architectures - sharing services within an organisation with benefits such as reuse and making information consistent. Take a simple example such as a catalogue of products for a furniture company. As a shared and open service, all of the companies systems - Sales, Marketing, Support, Delivery and Billing applications - can use this information in an open and consistent way.

&lt;p&gt;
If a service is very open and easy to use (e.g. services operating via a RESTful interface) then there is a good chance that applications will use it in a way you didn&#039;t originally intend and probably by applications you don&#039;t even know about. This sounds great but you&#039;ll soon come across the issue that you&#039;ve lost the ability to audit the current use and gauge the effect of any change. As an example let&#039;s say you want to add details for &#039;forest sustainability&#039; to our furniture information. We add a block of xml to describe this and release. However an application that uses our service starts generating errors as it&#039;s not expecting this new information. (We could argue that it &lt;i&gt;shouldn&#039;t&lt;/i&gt; do this but this is what happens in the real world.) Problems are more likely if you have to modify rather than add to your format. Changing an integer to a floating point number could cause strange issues.

&lt;p&gt;
You need to be able to get dependent applications to test with your new service before you release but who&#039;s using it and how are they using it? You can log the incoming requests to know what is being used but you don&#039;t know who is using it - so how do you know who has to test changes? 

&lt;p&gt;
This is a problem I&#039;ve been seeing recently and a solution is to use authentication even if you have no intention of restricting access. You can make the credentials easy to obtain but you need to make sure the users of your service are registered and provide sufficient contact information. Of course, actually getting the users to test and adapt to changes are another issue but at least they can&#039;t complain they weren&#039;t informed.

&lt;p&gt;
Has anyone else seen this issue and what were your solutions? Did you just &#039;publish and be damned&#039; or end up introducing heavyweight process to control releases?

        </description>
      
      
    
    
    
    <category>How do you define software architecture?</category>
    
    <comments>http://www.codingthearchitecture.com/2008/11/14/modifying_open_services.html#comments</comments>
    <guid isPermaLink="true">http://www.codingthearchitecture.com/2008/11/14/modifying_open_services.html</guid>
    <pubDate>Fri, 14 Nov 2008 19:53:38 GMT</pubDate>
  </item>
  
  <item>
    <title>Project Naming</title>
    <link>http://www.codingthearchitecture.com/2008/03/01/project_naming.html</link>
    
      
        <description>
          &lt;p&gt;
Whenever I&#039;ve started work on a new project I&#039;ve had an introduction along these lines:

&lt;/p&gt;
&lt;blockquote&gt;Wizard feeds into Pluto, which then re-values. It broadcasts changes that are picked up by Puma and recorded by Halo.
&lt;/blockquote&gt;
&lt;p&gt;
but what they really mean is something like:
&lt;/p&gt;

&lt;p&gt;
&lt;blockquote&gt;
The Market-data System feeds into the Trade Calculation Engine, which then re-values. It broadcasts changes that are picked up by the Trade Display GUI and recorded by the Audit system.
&lt;/blockquote&gt;
&lt;/p&gt;
&lt;p&gt;
The IT industry seems to have a peculiar habit of giving projects names that have no relation to what they actually do. Perhaps we feel the need to make a dull system sound more interesting (&#034;We can make the accounts receivable project exciting by calling it Skydive&#034;) and this is fine if you have only one system. I deal with architectures that have dozens of subsystems and this can be a real problem by adding to the learning curve and confusing newcomers. 
&lt;/p&gt;
&lt;p&gt;
There is an argument that a system&#039;s name shouldn&#039;t have meaning as the scope of the system will probably grow and change throughout its life. If a project has a very specific name such as &#034;stock-option pricer&#034; it will be inaccurate once you add the ability to price bond-options. A few years down the line it could actually be much confusing than calling it Pluto.
&lt;/p&gt;
&lt;p&gt;
It would be nice if you could change the name of a project/system but this can be very hard. Not because you can&#039;t re-factor code and change configuration but because of funding, politics and the users familiarity.
&lt;/p&gt;
&lt;p&gt;
What is your preferred technique for naming projects and systems? Something funky, something meaningful or a compromise? If you are attending the user group meeting this Tuesday you can share your craziest project name with me!
&lt;/p&gt;
&lt;p&gt; Update - It&#039;s been suggested to me that in a SOA system you should name a service very exactly and when the name no longer matches then it&#039;s probably time to introduce a new service. Neat!
&lt;/p&gt;
        </description>
      
      
    
    
    
    <category>How do you share software architecture?</category>
    
    <comments>http://www.codingthearchitecture.com/2008/03/01/project_naming.html#comments</comments>
    <guid isPermaLink="true">http://www.codingthearchitecture.com/2008/03/01/project_naming.html</guid>
    <pubDate>Sat, 01 Mar 2008 21:06:34 GMT</pubDate>
  </item>
  
  <item>
    <title>System Design and Reconciliation</title>
    <link>http://www.codingthearchitecture.com/2008/02/08/system_design_and_reconciliation.html</link>
    
      
        <description>
          &lt;p&gt;
Even those not working in Finance have probably heard about the &lt;a href=&#034;http://news.bbc.co.uk/1/hi/business/7208439.stm&#034;&gt;intriguing events&lt;/a&gt; in a large investment bank and the HUGE losses that have occurred.
&lt;/p&gt;&lt;p&gt;
The interesting comment from an IT and architecture point of view is the following:
&lt;/p&gt;
&lt;blockquote&gt;The trader responsible for the fraud had &#034;in-depth knowledge of the control procedures resulting from this former employment in the middle-office&#034;, the bank said.
&lt;/blockquote&gt;
&lt;p&gt;
I&#039;m sure the story and our perception of it will change as more facts emerge but  we can be certain that the banks internal security and control systems failed.
&lt;/p&gt;&lt;p&gt;
Many security and control terms such as authentication, authorization and auditing are probably familiar what about reconciliation? Consider a simple stock-taking example. The number of items delivered to a warehouse minus the number of items sold should tell you how many are still there. If it&#039;s less,  you may have a thief. ( If you&#039;ve ever been involved in a stock-take you may have seen many missing items - they tend to be small and valuable.) Stock taking is a physical example of reconciliation i.e do all the numbers add up.
&lt;/p&gt;&lt;p&gt;
Should you be doing this in a system of your own? If you have developed an online store do you periodically make sure that the number of orders to your website equals the number of payments received and items sent? It&#039;s very common for a system designer to assume this will be the case, as the data normally flows from one component to the next, but consider what happens if someone sends messages to your delivery system directly. They bypass your ordering and payment systems so there is no matching order or payment. This is likely to be an insider that knows the system well and you can lose a lot of money as you send out goods without being paid. These kind of issues are increasingly likely as designers move away from monolithic systems to SOAs.
&lt;/p&gt;&lt;p&gt;
Apart from internal checks between your own systems you can reconcile with external ones. In securities trading, we usually get a list of trades from the markets and exchanges at the end of he day and reconcile against the banks internal list of trades. This should show if a trader has been placing trades they shouldn&#039;t have (or failing to place those they should).
&lt;/p&gt;&lt;p&gt;
It&#039;s important to make sure that the main system and the reconciliation system are separate otherwise an error or fraud in one could effect the other. (Reconciliation is often simple and can be done in something like a script). 
&lt;/p&gt;&lt;p&gt;
Finally, you should remember that regression testing is actually a form of reconciliation between different versions of systems, so you may be doing some already!
&lt;/p&gt;
        </description>
      
      
    
    
    
    <category>How do you define software architecture?</category>
    
    <comments>http://www.codingthearchitecture.com/2008/02/08/system_design_and_reconciliation.html#comments</comments>
    <guid isPermaLink="true">http://www.codingthearchitecture.com/2008/02/08/system_design_and_reconciliation.html</guid>
    <pubDate>Fri, 08 Feb 2008 22:06:20 GMT</pubDate>
  </item>
  
  </channel>
</rss>

