Welcome!

My name is Naums, and this is my blog about the project called “The Sensor Organism“. The project is organized and sponsored by the YCCSA Summer School programme at the University of York. TL;DR – my project is about modelling biological processes using Arduino and XBee; the goal is to either benefit from the overwhelming effectiveness of some of the biological systems, or to get a biologically-accurate simulator which can help answer questions about biology. For more details on the project download the final presentation (PDF), poster (PDF) or poster abstract (PDF), and read the posts “1. About project”“12. Model: overview” and “21. Conclusions”.

There are a few reasons for this blog to exist:

  • It serves as a documentation source for those working on similar subjects.
  • It helps to promote the YCCSA Summer School programme and serves as an example of the project for the prospective participants.
  • It is my personal logbook that helps me retrace my steps.

The project is supervised by the University of York’ Dr Martin A. Trefzer (profile) of the Department of Electronics and Dr Dimitris Lagos (profile) of the Department of Biology, HYMS and CII. I am very thankful for the scholarship to the York Centre for Complex Systems Analysis (YCCSA) (homepage).
You are welcome to read my short bio; find me on Facebook and LinkedIn. A few examples of my work are available on GitHub. Do contact me in comments or via email (naums.mogers@gmail.com) if you have any questions or suggestions!

List of posts >

21. Conclusions

The project

The hard outputs of this project are:

  • Definition of the Sensor Organism model – a multi-cellular Artifical Developmental system based on Genetic Regulatory Networks (model overview).
  • A C++ OOP codebase for the implementation of the model – groundwork for the prototype.
  • Presentationposter and poster abstract.
  • This blog – 20+ posts on my progress, literature review, model description and Arduino development.

Further work and suggested model improvements are outlined in the project presentation; I plan to continue working on the code for some time until a proof-of-concept prototype is delivered.

I am very confident in saying that it was a highly rewarding project and during this summer I have learned quite a bit:

  • Biology: cellular systems, cell communication, cell differentiation, epigenetic and gene transcription mechanisms.
  • Bioinformatics: Artificial Developmental Systems, Gene Regulatory Networks, Artificial Epigenetic Networks and Cartesian Genetic Programming.
  • Computer Science: C++.
  • Arduino: IDEs, libraries and ZigBee radio programming.
  • Academic research: we had some great meetings with Martin and Dimitris where I learned about research on the verge of computer science and biology, as well as research in general. I really appreciate that my supervisors made sure to draw my attention to the bigger picture of my project – what question does it answer, how does it fit within the field and what should I take from it in the allocated 9 weeks. To enrich my experience, they often dwelled on the nature of interdisciplinary research: how researchers from different fields come with different mindsets, lexicons and approaches to problems.

YCCSA Summer School

Overall, I couldn’t wish for a more rewarding summer. I was given an opportunity and help in choosing both a problem and an approach to solving it, which was a chance to explore my interests and learn something new. One of the reasons I originally applied for this program was to see whether an academic environment is where I can see myself in the future – and this project turned out a perfect opportunity to find this out. It has certainly made academic research – and bioinformatics – very appealing to me, which will affect my future career choices. For this I have to thank my supervisors Martin Trefzer and Dimitris Lagos, and everybody at YCCSA who made the Summer School happen. Thank you all!

19. Development tools

While I was working on the prototype I ran across several good tools for Arduino development that replace the standard Arduino IDE.

Platformio

http://platformio.org/
Command-line code builder for embedded systems including Arduino. Written in Python, it is cross-platform and very easy to install. Initializing a new project is a matter of one command:

platformio init --board=leonardo

Then the code can be compiled (and uploaded into Arduino, if auto-upload is switched ON) using just run command:

platformio run

Or uploaded manually:

platformio run --target upload

Syntax-highlighted output reports all compilation errors/warnings, as well as Arduino memory usage. Here is a screenshot of two example Platformio outputs (click the image to enlarge):

Platformio screenshot

In the first run Platformio reported no errors and (excessive) Arduino memory usage; before the second run I randomly inserted a symbol in the code, so Platformio threw an error. Memory usage is a very useful metric that allows you to check if the code fits in your hardware without actually uploading it.

Overall, using Platformio you can ditch the uncomfortable Arduino IDE and use the IDE of your choice.

Cloud9 + Platformio

https://c9.io/
Cloud9 is a great cloud IDE for development in a range of different languages. For each of your projects you get a separate Ubuntu VM  that you control purely via browser: the IDE seems to be implemented in HTML5 or something as light-weight, that requires very little traffic. The IDE provides a good editor with all the standard features and a console, which you can use to install any packages you need – for example, Platformio.

Here is a screenshot of my Cloud9 workspace (click the image to enlarge):

Cloud9 & Platformio screenshot

Cloud9 has paid plans, but the free one also allows you to create one or two workspaces. Here are the free features I liked the most:

  • Nice editor themes, syntax highlighting and auto-completion.
  • Cross-platform – crazy enough, it works fine even on iPad’s Chrome. Even in the absence of hotkeys, it’s perfectly usable for some draft work away from PC.
  • Simultaneous editing from multiple sessions (different tabs, OSs or machines) just like in Google Documents. I haven’t tried it using multiple accounts so am not sure if a free plan allows it, but even when using one account, it’s the best collaborative online code editing tool I’ve seen.

In terms of hardware, on the free plan I got:

  • Quad Core Intel Xeon CPU 2.3 GHz (8 threads)
  • 512 Mb RAM
  • 1 GB disk space

Great processing power and just enough memory space to work on a small to medium project.

One thing that I missed that is supported only in a paid plan is an ability to connect to the VM remotely through SSH – I failed to set up University’s VPN on the VM and without SSH couldn’t set up a routine on my office PC to commit code from the VM to the internally-hosted git.

While I used Platformio on Cloud9 to only compile the code, it should be possible to route its output to a local machine with attached Arduinos to program them from VM.

Overall, Cloud9 allowed me to code outside the office: at home and even on the train with very poor Internet connection. On exit the session is paused, so next time I resumed right from where I stopped.

Codebender

https://codebender.cc/
Cloud Arduino IDE, which installs a Chrome Extension that allows you to upload your code to the physical Arduino. Haven’t actually used it as registration can only be done when Arduino is plugged in – and I was on the move – but in case it is difficult to program physical Arduino from Cloud 9, Codebender might be a good alternative.

18. Self-organization

Among other things, our Sensor Organism model is a multi-agent system. Such systems often possess a very useful property of self-organization (SO), and in this post I’d like to explain why our model has this property as well.

Definition

Works by Bonabeau et al [1] and Garnier et al [2] define self-organization as follows: self-organization is a set of rules governing functioning of the lower-level constituents of a system whereby global patterns appear without components explicitly referencing the global context. In other words, the system as whole possesses useful properties that are not expressed by any single object.

Components and their instances in our model

There are four components basic to self-organization [1] [2]:

  1. Positive feedback: recruitment and/or reinforcement, i.e. means of promoting optimal solutions. Our model does not employ recruitment, but it does rely on reinforcement through cell energy, which is awarded to a cell for good recognitions.
  2. Negative feedback: saturation, exhaustion and/or competition, i.e. mechanisms that balance out positive feedback. Saturation is implemented via the configurable energy buffer size; competition is explicitly implemented in allowing the best cell to contribute to the gene pool.
  3. Fluctuation amplification: introducing random phenomena helps avoid stagnation at local minima and discover new solutions. There are several examples of fluctuation amplification in our model:
    1. During differentiation stage, stem cells begin traversing genes from a random position in genetic sequence.
    2. When optimizing the GRN, functional cells change gene values by random values.
  4. Multiple interactions as means of fault tolerance, bad solution filtering and convergence on stable solutions. This is achieved by having multiple cells processing the same signal.

Advantages of self-organization

While we are at it, let’s think about what self-organization gives to our solution. As described by Bonabeau et al [1] and Garnier et al [2], self-organized system can possess following properties:

  • Multiple stable states: the system can converge on multiple different solutions depending on its environment.
  • Bifurcations: the system can switch between optimal solutions during runtime as environment changes.
  • Evolution: as mentioned in the definition, during its development the self-organized system acquires new useful properties purely from local interactions between its agents.

Conclusion

Self-organization is what endows our system with the ability to create good solutions without any complex mathematics; this post is an attempt to discuss this trait in a more or less formal way.

References

Some of this material is based on the “Self-organization” paragraph of my Bachelors dissertation. Partial report, source code and demo of my project are available on GitHub.

[1] E. Bonabeau, M. Dorigo, and G. Theraulaz, Swarm intelligence: from natural to artificial systems. 1999.
[2] S. Garnier, J. Gautrais, and G. Theraulaz, “The biological principles of swarm intelligence,” Swarm Intell., vol. 1, no. 1, pp. 3–31, 2007.

17. Model: genetic encoding

All GRN-related information is stored in two chromosomes, which are multi-linked lists. This encoding is inspired by Cartesian Genetic Programming (read more in one of my previous posts as well as on cartesiangp.co.uk).

  1. Chromosome 1 (Ch1) contains ordered genes of tuple type:
    • (Bucket number, Next bucket occurenceNext gene of the pattern)
    • Parameters 2 and 3 are links to other Ch1 genes.
  2. Chromosome 2 (Ch2) maps Ch1; each Ch2 gene represents a single pattern (function) in the sensor signal that is encoded in a sequence of Ch1 genes.
    • (Sensor ID, First Ch1 gene, Last Ch1 geneNext sub-pattern, Next pattern)
    • Parameters 2 and 3 are links to Ch1 genes, parameters 4 and 5 are links to Ch2 genes.

The colour coding above is following: blue means a pointer to the next member of the sequence, green means a pointer to the start of the next sequence, red means an external pointer.

Each separate sequence of Ch1 genes starts with an ordered list of genes with unique buckets that are present in this sequence (i.e. pattern map) – (B1), (B3), (B4), (B5) – which are followed by other instances of these buckets in the order of occurrence in the pattern – (B1), (B3), (B4), (B5), (B3), (B1), (B4), (B3), …

Image below is an example of pattern encoding. On the chart we have 3 patterns:

  • The first pattern consists of two sub-patterns (sinusoid and a constant)
  • The third pattern is the same as first sub-pattern of the first pattern, namely a sinusoid.

Genetic encoding

On the chart red diamonds represent Ch1 genes.

As you can see, Ch2 is a direct translation of the facts above: there are four Ch2 genes for each sub-pattern, next sub-pattern of Gene 0 is Gene 1, genes 2 and 3 don’t have sub-patterns other than themselves, and Gene 3 points to the same Ch1 sequence as Gene 1.

Optimization

The abundance of pointers is necessary to be able to traverse the GRN in the way described in my previous post, however we can get rid of some of them without having to sacrifice much computational power on traversal.

The unoptimized version is quite lengthy: each Ch1 gene is 5 Bytes long, each Ch2 gene is 9 Bytes long. Here is how I managed to optimize the scheme:

  1. Chromosome 1:
    • Genes of each pattern are grouped by buckets: [All genes with B0], [All genes with B1], [All genes with B2], …
    • The first gene of the bucket group is of the following type:
      • (Bucket number, Next bucket groupNext gene of the pattern) [5 Bytes]
    • Other genes in the bucket group are of the following type:
      •  (Bucket number, Next gene of the pattern) [3 Bytes]
  2. Chromosome 2:
    • We don’t actually need the pointer to the last Ch1 gene in the sequence, so -2 Bytes.
    • The genes are grouped by an affiliation to sensor. The first gene in the sensor group is of the following type:
      • (Sensor ID, First Ch1 gene, Next sensor group, Next pattern) [7 Bytes]
    • The genes (representing sub-patterns) are also grouped by patterns, i.e. all sub-patterns are stored in a continuous array. The first gene in the pattern sub-group is of the following type:
      • (First Ch1 geneNext pattern) [4 Bytes]
    • Other genes in the pattern sub-group are of the following type:
      • (First Ch1 gene[2 Bytes]

Thanks to this optimization the size of Ch1 is reduced by 0-40% and size of Ch2 is reduced by 22-78% – the economy is bigger of larger and more complex patterns.

Here is the same example from above with an optimized encoding:
Optimized genetic encoding

 

16. Model: GRN

For a general description of GRNs, read my post “Gene Regulatory Networks”. This post is about the specific variation of GRN that we use in this project.

Within this model I put GRN on a genetic level not to overcomplicate things, however, to be more precise, GRN is actually an epigenetic mechanism as it is an algorithm of traversing the gene sequence.

Gene Regulatory Network

Our GRN consists of the following components:

  • Each gene has a bucket (expected value) associated with it; genes are activated by sensor readings which fall into the gene buckets. Genes output cell energy, which is directly proportional to the difference between the expected and actual inputs.
  • Gene outputs are channelled into the energy-producing output node and into the gates of the subsequent genes.
  • Gene gates filter inputs. Initially all gates are open; later gates are opened by the preceding genes.

Algorithms of the components are defined a bit more formally on the image above.

Notice the looping of the gate opening signal channels – it represents the fact that this GRN encodes periodic functions.

Here is a small worked example to demonstrate how the GRN operates. This GRN encodes a sinusoid function.

(Images above enlarge on click, the steps below are copied in the image descriptions)

  1. Input the first sensor reading, say Bucket 2. As initially all gene gates are open, this input is channelled into all genes. In other words, we don’t know the current phase of the signal function, and one of multiple genes is associated with current reading.
  2. Check which genes were activated by the reading (here Gene 1 and Gene 5) and produce outputs from them, namely cell energy and gate opening signals. Gene 1 opens the gate of the Gene 2, and Gene 5 opens the gate of the Gene 6. In other words, we have reduced the number of possible function phases to 2 and we will accept both Bucket 3 (in Gene 2) and Bucket 1 (in Gene 6).
  3. Next iteration: input the next sensor reading, say Bucket 1. Bucket 1 can be channelled into Gate 6 and Gate 8, but the gate of Gene 8 is closed, so the input is channelled only into Gene 6.
  4. Produce the outputs of the only activated gene, namely cell energy and gate opening signal. After this step only one gate is open at any time, which means that we have determined the phase of the function – we will only accept one input value.

15. Model: cellular level

Stem cell

Stem cells fill the colony on initialization and take place of functional cells once the latter die. Their only task is to differentiate, namely:

  • Choose a sensor. On initialization stem cell compares complexity level of each sensor (read more about complexity levels in my previous post) and the number of functional cells that are differentiated to the sensor. Based on that it chooses the sensor, which lacks most cells.
  • Choose a pattern. When a sensor is chosen, the cells check patterns stored in the DNA bank that are associated with the sensor. If matches are found, the cell picks a random one and constructs respective Gene Regulatory Network (GRN). If there are no matches, the stem cell creates new GRN.
    • Read more about GRN in general in my post about GRNs, and about the variant of GRN used in this project in the next post.

Once the stem cell is differentiated, it becomes a functional cell.

Functional cell

Functional cell

A functional cell contains two components:

  1. GRN, which is trained on a particular pattern, takes sensor reading as input and produces cell energy as output. Confident recognitions produce more energy.
  2. Energy buffer stores the cell energy. The energy dissipates over time; without the energy the cell dies.
    • Energy dissipation rate parameter allows us to optimize the model to the problem domain. If we expect anomalies to be fine-grained, we can make cell energy dissipate fast – then cells will die quicker and report anomalies on smaller changes. In that case noise in the data might trigger the alert; however, if we make dissipation rate too small, we may miss actual anomalies. Balance is essential!

In real life cells survive by acquiring resources from the environment and converting them efficiently into the energy. Here we use the same metaphor: instead of proper metabolism, our virtual cells perform a function that is useful to us and, cruelly enough, they have to do our job good enough to survive. Cells running out of energy means that the data changed in such a way that cell can no longer perform its function, which might be an indicator of an anomaly.

14. Model: organ level

There are two types of organs: DNA server and Cell colony.

DNA server

DNA server

Simple auxiliary organ for storing genetic sequences, a DNA server solves the problem of the constrained memory size of Arduino (2.5 KB of RAM). Like a regular server, it waits for requests, which can be of one of two types:

  • READ request specifies the chromosome and the starting gene number; the server responses by sending requested gene sequence. Both chromosomes are structured as multi-linked lists and encode periodic functions (patterns), thus server can find the end of a sequence by detecting looping.
  • WRITE request contains the chromosome, starting gene number and an updated gene sequence. This is a command to overwrite the respective gene sequence with updated (improved) values.

Cell colony

Cell colony

Structure

Cell colony is an Arduino that is responsible for collecting and preprocessing sensor data, as well as learning and recognizing patterns. It consists of the following elements:

  1. Sensor driver is a thread that collects, normalizes and smoothens the sensor readings. Finally, it releases them into the intercellular space (buffer) in the form of signalling molecules.
  2. Signalling molecule is an object that holds single reading and the source sensor ID. Once a cell consumes a signalling molecule, the molecule is removed from the intercellular space. Each reading can be duplicated in multiple signalling molecules; the number of duplicates depends on the complexity of patterns in the sensor signal. The complexity is estimated by functional cells.
  3. Functional cell is the main computational entity of the system. During its lifetime it collects signalling molecules of one sensor and tries to recognize a single pattern in the reading sequence. If a cell fails to recognize the pattern, it eventually dies. Functional cell constantly optimizes the underlying recognition mechanism – creates candidate solution – the most efficient cell gets to contribute the genes of its solution to the DNA server. Additionally, the winner gets to set the respective sensor driver’s signal complexity parameter.
  4. Stem cells is what Arduino is initially filled with, and what takes place of the functional cell once the latter dies. Stem cell responsibility is to choose a sensor (based on signal complexity and the number of functional cells already working with the sensor) and to either choose an existing pattern from DNA server that closely matches the signal or initiate new pattern learning. Choosing the sensor and the pattern is what I call cell differentiation; once a stem cell is differentiated, it becomes a functional cell.
  5. DNA cache is the buffer which cells use to store the genetic sequences during their lifetime.

Signal complexity

The complexity of the signal – and the directly proportional number of signalling molecules – defines the number of cells that can differentiate to a particular sensor. If the signal is complex, we want more candidate solutions in hope that it will increase the chances of finding the optimal solution – in other words, we want more cells to process the same signal. When the complexity parameter is decreased, new stem cells don’t differentiate to the respective sensor if there is already a sufficient number of cells that are differentiated to the respective sensor.

System health

When sensor signal suddenly changes in a way that was not previously observed, functional cells start to die. By monitoring the death rate we get a characteristic that is system health; low health is an indicator of either an anomaly or a faulty sensor.

This observation has at least two consequences:

  • Initially the system detects anomalies on each pattern – thus we have to specify the learning phase, when all alerts are discarded.
  • Depending on the character of the anomalies we want to detect, we need to configure how soon will cells die without successful pattern recognitions (I call this parameter “cell energy decay rate” – read more in my post about functional cells).
    • What I mean by “the character of the anomalies we want to detect” is anomaly granularity or precision: with high precision we will detect noise, with low precision we will only detect long anomalies and miss the subtle ones.

13. Model: organism level

Sensor organism

On the organism level the system consists of multiple organs (Arduinos), each belonging to one of two types:

  • Cell colony
    • Collects and pre-processes sensor readings
    • Learns patterns in sensor data
    • Detects anomalies
  • DNA server
    • Stores genetic sequences, that encode learned patterns
    • Provides data to cell colonies on request
    • Accepts updated genes from the cell colonies

It is the cell colonies that perform the main function of the system, namely anomaly/fault detection. Each cell colony has a characteristic of health, which changes depending on how well the colony recognizes patterns in the sensor data.

The reason we need DNA server is following: a cell colony simply doesn’t have enough memory left to store all the lengthy genetic sequences in its mere 2.5 KB of RAM. Instead the system dedicates one or more Arduinos to just storing the DNA. On the bright side of this hardly elegant solution, it would be easy to expand this model if we decide to add some genetic code optimization mechanism – something running in the background and reducing DNA size.

See next posts for lower level details!

12. Model: overview

Couple of weeks have passed and now I finally have a model. I present to you… the Sensor Organism! In the following few posts I will outline the model on each of the four levels it is defined on, namely: organism, organ, cell and genetics.

Four levels of the model

The model describes an Artificial Developmental System (wiki) with all perks of the genre:

  • The system evolves, i.e. optimizes its behaviour;
  • Experience is captured in a genetic sequence;
  • Learning is based on Gene Regulatory Networks (see my post on GRNs);
  • System relies on multiple agents called cells;
  • Each cell constructs only a candidate solution: the system is tolerant to failure of individual cells;
  • Cells die, which is an indicator of an anomaly in the signal or a sensor failure;
  • On initialization cells differentiate into the variation which system needs the most at the moment.

Each level of the model is described in detail in the following posts:

Finally, a brief analysis of the model in the context of self-organisation is in the post “18. Self-organization“.