Self-Organised Neural Networks: A simple, natural and efficient way to intelligence

The piece of code bellow is indisputable, reproducible, proof of what we are going to say in this post.

It implements a basic spiking neural network, with a simple learning method that introduces a new paradigm.

The accuracy reaches state of the art (98.9%) on PI-MNIST with 750,000 low precision (binarisable) connections (>98.6% with 80,000 connections), one layer, no bias, no back-propagation and only additions. It is yet to be optimised. It works online, and is backed by existing mathematical theorems. It also relates to the real cortical structure and personal experience; and even popular wisdom.

So it cannot be just ignored or dismissed. It requires proper falsification, or further analysis and extension.

I will describe the system and its mechanisms, analyse it, and share the vision, starting from the beginning, in plain language, so everybody can understand, and, yet, it will be essentially complete.

I am setting up a website where I will add more information and code depending on expressed interest, if any. The code at the bottom is there for completeness, immediate reference and, required, formal proof.

The system

A Neuron, in the brain, is connected to multiple other neurons from dendrites. It receives signals carried by nerves. Each connection has a strength (weight) that varies over time. There is positive and negative connections. When a neuron receives a signal from a connection, it ‘adds’ or ‘subtracts’ (chemically) the weight associated with the connection, inside its body, to a ‘total’. If it reaches the limit of what it can hold (its threshold), it spikes: it sends a signal through its axon, its only outgoing nerve, for other neurons to receive. In doing so empties its content. And so on… In our system, there is neurons that operate the same way.

When a neuron spikes, the connections that have contributed to it are reinforced. It is Hebb’s rule. In this system, we do the same.

Connection are not permanent. They are created on the fly, and their weight changes over time. Decay reduces weights over time, as well. When the weight goes down too much, the connection is cut. In this system, we do the same.

In the brain, neurons group together, locally, in what is called columns. They tend to spike, or not, in a fairly synchronised manner inside a group. It is a matter of scientific debate how or what for, but it is. In this system, we set up neurons in groups, as columns.

We have, at this point, described most of the system. We will now use it in a real test case and describe how it operates.

The mechanism

In the field there is a dataset that is used in most research papers to compare the performance of the systems and methods presented: the MNIST dataset. It is made of the digits, 0 to 9, handwritten by different people and presented in a pixel matrix of size 28x28 in shades of grey. There is 60,000 samples to learn from. The system must use those to ‘learn’ to identify digits. There is an additional 10,000 samples to test the system with, after it has been trained.

Intelligence is not just ‘one thing’. There is many clues in a handwritten number. Most of them relate to geometry: the shape of the line, its orientation,… To make it more difficult, in the Permutation Invariant (PI) version, the pixels of the display matrix can be mixed up randomly. That is the version used in all research papers on the core subject of’AI’ (no augmentation, of course). The clues that are left are only the correlations of pixels ‘lightening up’ (it is, actually, darkening), wherever they are.

The first task of a brain is to detect correlations, before it can start to wonder about causality.

So we set up ten groups of neurons, one for each digit. Each neuron is connected to a few pixels from the input matrix.

When we present a sample to the system, each neuron will look at the pixels its inputs are connected to. If a pixel is on, the weight associated with the input is added to its total. If that total reaches a given, fix, threshold, the neuron ‘spikes’ and the total is reset to zero. The spikes are counted per group. We compare the spike count of the groups. The digit group with the highest count is the answer of the system.

Learning is adapting the weights.

When the system learns from a sample, it adjusts the weights of the connections that are active. In neurons that belong to the group that represents the actual digit, the connection weights will be augmented (the weights go up); positive learning. In the neurons belonging to the other groups, they will be reduced; negative learning. All weight adjustments are of the same, fix, amount.

The connections decay over time. All weights are reduced by a small, fix, amount at regular intervals. This is done on the absolute value of the weight (so negative values increase).

The system starts with no connection. Before we learn from a sample, a small number a connections are created from active pixels in the sample to random neurons in each digit group that have available slots (Hebb’s rule). They are given a fix initial weight; positive for neurons in the sample digit group, negative otherwise. The connection weight will vary over time as learning goes. If it goes under a set limit (in absolute), the connection is cut off.

The data in the MNIST is made of pixels in shades of grey. To input them into the system, we reduce the depth to 4 shades, linearly (1,64,128,192), and present 4 views, in as many ‘cycles’, to the system. Increased precision is useless, and we want binary inputs.

That is it for the mechanism.

As everybody knows, that has studied the field, the basic perceptron—this is a most simplistic version of it—is intellectually interesting, but it is not usable as is.

So this should give a very bad result on the task, and it does.

In 1986, Rumelhart, Hinton, Williams, proposed the use of hidden neurons and back-propagation (BP/​SGD), on the perceptron, as a way to solve that problem. And it works, with additional details, providing the base for most of the recent successes in AI.

But it requires a very large amount of silicon and power, so we had to wait thirty years for that, and we will have to wait much longer again before we see it used at scale on a personal device. More annoyingly, the method involves precise calculation of the error between the result the system gives and the result it should give, to be applied on each neuron in a precise manner, and backwards (nerves are directional).

Clearly, the neurons in the brain do not do that. So it is, indeed, completely artificial. It means that the brain does it another way, that nature has found, and we have not, so far.

Another way

Born in 1964, I have been searching for it fifty years. I followed advances in the field as a kid, got my first real computer at 15, studied the field in the mid 80s, until 1986. Like everybody else at the time, I calculated how long it would take for BP/​SGD to be usable, and decided to just run my life doing other things. But before I stopped, I listed the constraints that you would have to accept to make the wait shorter (only additions of integers; sparse connectivity; it should be possible to implement it on asynchronous analog hardware, when memristors become a reality), and kept it in mind to ponder in my spare time, looking for that other way.

The sparse dynamic connection part of this system is the leftover of that work. Initially, it was much more complicated, but the simpler version used here is enough, at this scale. On its own, it proved not to be enough to reach the accuracy achieved with BP/​SGD.

Starting in 2016, I searched frantically for month, tried hundreds of things. One day, I gave up trying to control everything. And it started working.

I was trying methods used in chess: take the system, at some point in the process, sort out the samples based on their error and try to find a way to decide witch sample to learn from next. I could not find one. I decided to try simple: let the sample with the biggest error be the one to learn from. It worked.

After a very large amount of work refining it, this is the main result:

The question to ask is not ‘how’ to learn, but ‘when’.

When presented with a digit sample, the system checks if it is worth learning from or not. The group that represents the digit (IS group) should have a high response and the others (ISNOT) a low one. High or low are not set numbers, they are relative to all responses, and, for one sample, between groups. If you take two similar samples, one written with a thick line and one with a slim one, the spikes will mechanically be higher with the first one, as there is more input. The ‘8’ have more pixels active then the ‘1’. So high and low must be expressed as the difference between the groups, for each sample.

For the IS group, what you want is for it to have a higher spike count then the highest of the ISNOT ones. The other ones do not matter. So you take that difference (IS-difference). For the ISNOT groups, you just want them to be low, so you use, for each group, the difference between that group and the average of all the other ones (so 9 ISNOT-differences).

For each sample, when a sample is presented, you take (and store) those values (IS and ISNOTs).

- For the IS group, if the IS-differences is in the top 5% of all those seen so far, you do a positive learning.

- For each other group, if the ISNOT-difference is in the top 0.66% of all those seen so far, you do a negative learning.

Many variations are possible. This one works best on MNIST (only by a small margin).


That is it. And it is enough to reach state of the art. As simple as possible, but not simpler.

One useful addition to the update mechanism is to only update the weights of a neuron if its total when it spikes is below a limit. The idea is that if a neuron spikes too quickly, there is no point in increasing it: it is already effective. It has the added advantage of a large reduction in the number of updates, and a better stability.

There is many ways to implement that rule. In this code, we set up two thresholds, used at learn time. The low one is the spike one: if it is reached, the total is reset. Then, if the total is under the high one, the update takes place, otherwise it does not. It is another form of selective learning, within each neuron.

The values for the adjustments, decay, cutoff weight, etc…, do not change much to the final accuracy. They mostly affect the convergence rate. The variance in accuracy, within a run once it has converged, and between runs, is within 0.1% on the tests set, and it always converges.

Quantilisers

The last mechanism we will detail here, because it is of interest, is the one we use to compute the values at a given quantile of the distributions of those differences. You can store and sort them, of course (that is what I did first), but it lacks elegance and you need to store all that data. There is, again, a much easier and better way, that everybody can actually come up with.

Stand in the street with a vertical measuring stick. Place the mark at the bottom. When somebody passes by, if they are taller than the mark, up the mark one notch. If they are shorter, down the mark one notch. As people pass by, it will go up and down.

Think about it: where will it converge to?

That is correct (most people figure it out): It will be the average height of the passers by. Actually, it is the median height. But for heights, a gaussian distribution, it is the same.

1 up, 1 down, gives the middle: 100 /​ (1+1) = 50%.

What would happen if you were to up by three instead of one when somebody taller passes by?

3 up, 1 down, gives the value at the first quarter of the distribution: 100 /​ (1+3) = 25 %

9 up, 1 down will give the height over witch they are in the 10% tallest: 100 /​ (1+9) = 10%.

For the bottom part of the distribution, do 1 up and X down.

So, if you want to know if a sample difference is within the highest 5%, for each sample seen, up by 19 when it is over the current estimate (and you learn) and down by 1 if it is under (and do not learn).

This is beautiful for seven reasons:

  • It is simple and most people can figure it out.

  • It only uses addition/​subtraction.

  • It only uses one stored value, so it makes the whole process readily usable.

  • It works ‘online’. That is, you can go on forever with it. If a group kids passes by, then a group of Dutch tourists, it will adjust to both as it goes.

  • There is a mathematical evidence that this is, indeed, valid

  • There is a strong similarity between this process and that of the neurons in our system. Each ‘notch’ is a, binary, connection and the threshold changes instead of the connection weights. And it always learns; no ‘meta-quantiliser’...

  • Before waiting 30 years, I needed to convince myself that it was not impossible to ‘evaluate’ thing with a simple process. This thought experiment did that, so I did wait and kept thinking.


And that brings us to the final nail.

I was looking for a name for this last process, and I came up with the obvious ’quantiliser’. I typed it in Google, out of curiosity. Do it. At the top (it is an unusual word) you will find links to this paper, by Jessica Taylor, who I bow to.

It is a perfect vindication of using ‘quantilisers’ to decide when to learn. It is also aimed at improving the safety of AI systems. I take it.

The conditions under witch it is valid are met, in our system, because each update is of a small fix value.

The paper is about finding a better method than expected utility maximisation, in a decision process, for a good and safe answer. Maximising is precisely where is started when I tried learning from the sample with the worst error.

The proposed answer is to select any decision that “is in the top proportion of some ‘base distribution’ over actions, sorted by the expected utility achieved if that action is executed”. She even had the common decency to provide a rational analysis of the principle behind it. More right, as well.

Run our system without the quantiliser (learn from all samples) for a while, then stop. From that same starting point each time, apply learning from each sample (the ‘action’) and measure the impact it makes to the global accuracy. That is the ‘expected utility’. Sort those impacts with, next to them, the IS-difference for the sample. You will immediately see that there is a very good correlation.

So the ‘top proportion of the base distribution over actions’ can be those samples that have the highest IS-difference. It is interesting to note that correlation diminishes, gradually, as the system learns and accuracy improves.

It is remarkable that, in the case of ‘learning’, a suitable utility function should be ‘when you are wrong’.

Is it?

Meaning gets lost in symbols and functions, as Descartes noted in his ‘discourse on method’, and common-sense must be ruled, not dismissed and ignored.

Analysis

This system works, state of the art, and is as simple as possible. How come it works ?

If you do not apply the selection rule, it performs badly. As you increase the selectivity, the accuracy increases with it (with an optimum). All other settings have a marginal impact on the result.

Unambiguously, this is the one parameter that matters; what makes it work.

One way to look at it is that the learning system only sees a fraction of the dataset at any time. The samples in that subset change over time. A sample that was out can come in and another go out, following a learning step.

So it can be seen as two interacting, changing, sets: the weights and the samples.

And it goes on forever, with no outside intervention, adapting to changes in the learnable subset, if there is some.

That is a self-organised system.

So we will call it Self-Organised Neural Network. SONN.

The probability for a digit sample to be learned from is related to its difference to the norm (for that digit). Samples that are unusual will be learned from more often.

Instead of calculating a proportion of the difference to back-propagate and adjust the weights, as in BP/​SGD (the generalised delta rule), here, we adjust the weights by a fix value, a number of times that is proportional to the difference; to the same effect, but without the complications.

Each neuron in a group can be seen as a model, and the group does pure model averaging. It is also the case with the cycles. That is why we use that type of input mapping on this task (a form of frequency coding). As usual, the best mapping will be task dependent and a subject of further study (data has to be binarised). An interesting fact is that the total carried between cycles can be halfed, for a slightly better result (an acceptable exception as it is just a right shift, implementable in analog).

The system is also resilient. Random ablation of 20% of the pixels goes unnoticed. Same with additional noise.

To those that are interested in recent developments around sparse connectivity, note that this system is really sparse by structure and, on top of it, a large part of the connections is also ignored during training.

You can split neurons into subgroups trained with separate quantilisers and group the subgroups by averaging them, or any other ways of splitting/​regrouping, to no difference.

The neurons provide Internal representations due to their limited connectivity, and they replace hidden layers. The group/​column is not a ‘layer’, but a convenience to interface with the quantilisers. In a larger system, the separate neurons provide, in population coding, a convenient way to dispatch information to the rest of the system (must see Numenta).

A recurrent setup works, to similar accuracy, by using the output of all neurons as negative inputs, instead of the input matrix.

To stop the weights from rising indefinitely, they are limited to the threshold value. It is not required. If you set the limit to a multiple of it, it works, with a limited loss on accuracy. Whatever value you set as a limit, weights will accumulate at that limit. Decay is a way to limit that, but if you let it go, it only affect the accuracy by a small amount. That brings the question of binarisation: if they all are at the limit, it is, in effect, binarised. Indeed, If you treat all weights, at test time, as 1 or −1, but keep full precision at learn time, the system is mostly undisturbed.

What matters is the difference between the number of positive and negative activations, not the weight values.

Empirical evidence indicates that there is an optimal ratio between the two quantile values (IS and ISNOT). The IS quantile (5%) is always valid. We set the ISNOT quantile to be such that the ratio of the number of negative learnings to positive learning is around 1.18. The reason for that is yet to confirm, but it is my educated guess that it is related to the ratio of the surface (number of pixels activated) covered by the ‘sum’ of a given digit against that covered by the ‘sum’ of all the other digits. There is, then, more negative connections but with smaller weights. That measure of ’surface’ can be done many different ways, and finding one that ‘fits’ would not prove anything. It is specific to this dataset. Another way to balance this difference is by using different weights for positive and negative updates, with similar effect.

A large part of the time in this research was spent looking at every detail of the learning mechanism, trying to imagine all the ways they could be changed. Every line of code (is there additional conditions?), every variable (should it be within some limits?) is the root of a bunch of possibilities.

I tried thousands. And they combine. They have to be tried more than once, and at different values, and network size.

I have mentioned a few points, saying that, in the end, they do not really matter. The only ones that have some impact are the parameter for the ISNOT quantilisers and/​or the weight update difference, and decay. And even those have only a small impact on final accuracy; they impact convergence speed. The resulting distribution of connections and weights can vary enormously, but the accuracy is the same. You can change parameters during a run, prune connections, change weights arbitrarily as it goes: the system always recovers very quickly.

That, in our view, is the most interesting aspect of it, and demonstrate the power of self-organisation.

The real shift, as I see it, with this paradigm, is from functions to distributions. Distributions are what our brain captures and builds on. We do not find a suitable function for each existing distribution. Gaussian are widespread and convenient, but they are not it all. Felines size distribution has two bumps; the moment ‘intelligent agents’ come into play, Gaussian’s are replaced by Pareto’s. There is something else about maxout… Quantilisers handle any type indifferently. High precision is futile.

There no such things as functions anywhere in this and I know it will aggrieve a lot of people, but feel free to try add some anywhere to optimise something…

Reality is that what is needed as a base for the future is a basic building bloc that is small, efficient and resilient. I have wasted very little time testing large networks, just enough to get to that 98.9%. Remember this is not supposed to work.

Nature can be somewhat described using mathematics, but it does not operate using them. Not even multiplication.

As you see, there is a lot to look at and investigate. This is only just the beginning for this new approach.

Intelligence

A last thought experiment, for everybody:

We are continuously exposed to situations, and make decisions. There is something on the path, should I go around it? My stomach tells me I am hungry: what am I going to eat? Will the hero win in the end? (And down to: should I move my arm up a bit as I turn to seat in that armchair?).

Each time, we use a list of conditions to make the decision. Is there a car coming? Is the object small, dirty? Am I at home? What’s in the fridge?… We decide based on a sort of weighted sum of those conditions. It is never twice the exact same ones, and the weights are approximate.

After we have made the decision, we get a result. That car was really slow, I had to wait too long; walking on that piece of paper did not matter. I should not have had a burger, I feel bloated. I ‘forgot’ I had already got one for lunch. That chicken was nice. In that scribbled phone number, the third digit was a 7, not a 1. They were fast writing it. And I knew that.

We up and down, a bit and fuzzily, the causes that we remember and identify to have contributed to the decision, depending on wether they were for or against that good or bad decision. We do that thousands of times a day and we are mostly right. As a baby we were mostly wrong. Sometimes we make a bad decision and, with little awareness, we adjust our views. That is learning.

In logic, there is three modalities: deduction, induction, and abduction. That last one is mostly useless from a formal point of view -it is fuzzy- but it is the one that is most used in daily life, and is the basis of creativity. The way we ‘feel’ the world around us is abducted from our daily life.

That is the first stage of intelligence, the one this system aims at, and the one we can all relate to. Even the dog.

I should eat less junk food. I should not be too careful. I should check with the person that gives me their number. That is generalising, and it is the next step.

Setting up rules to proceed to generalisation is method, and the (last?) step after.

Roadmap

We mentioned the existence of neuronal columns in the brain and made an analogy with our neuronal groups.

There is about 12 billion neurons in the neocortex. A column is a few hundred neurons; say 1,000. That makes for 12 Million columns.

There is 50,000 words in a dictionary. They combine, in different ways, into ‘concept’ or ‘class’, ‘type’...

12 Millions /​ 50,000 = 240 combinations per word. It is reasonable.


Columns will assemble in zones, and than regions, forming a map, driven by inputs provided by sensors linked to different areas on its edge. The design part is in the map settings. Look at the brain map across evolution. It will learn from real life: that is our dataset.

Part of the map (the neocortex) is detached from direct inputs, and free to structure as the self-organising process will make it to be, based on what has been received from the edge. That is were global structuring will happen.

It will, itself, receive information from… itself, and structure it. What would you call that ?

‘I’ is abducted self.

Once it has that, it can move from correlation detection to causality analysis. Call it reason.

And we must not forget the hormonal system (feelings, emotions,...) that is separate, but interact with the nervous system, inwards by nerves, outwards by hormones, through the blood stream, to neurotransmitters.

As model averaging shows, it is better to have more, different, less correct, opinions, than one that knows it all. On the PI-MNIST the system can reach 98.5% with only 200 neurons per group and 9 connections each. Increasing the size leads to very quickly diminishing returns (but better stability across runs), and is useless for practical matters.

Existing commercial hardware is usable for our system, but wholly inefficient. With a simple hybrid design (it is only routing and additions, absolute synchronicity is not required, small loss acceptable), this system, at that scale, could fit today (we ‘think’ at 40Hz, CPUs at 4Ghz), in a large rack. In a skull in 20 years.

Movement is quite simple if you think about it, and will benefit from nitinol muscle. Vision can be done once and for all. Unlike movement that can be altered, it never changes. A wise set of filters will save a huge amount of hardware.

Data path from the eye to the brain is narrow and the sampling done by neurons in the retina is similar to what is done by our neurons. Those are to be further studied.

Time is relative. Chronology is not. ‘Brain waves’ serve a purpose for approximate timing and synchronisation.

Maybe it is time to look again at Copycat.

Unsupervised learning is an obvious next step to be studied.

The genes do not contain a complete description of the cortical map, just rules for its development that work most of the time. Finding global rules and architectures that foster the development of the map is a core item on the todo list.

Conclusion

It does sound ridiculous that it could be so simple (disappointing ?), and it took me a long time and thousands of tests to convince myself. this is not something I ‘thought of’. I just found it on my way (like Columbus found America). My only merit is to have looked where nobody would waste their time looking, spent thousand of hours on it on my own time (and that of my family), and not dismiss it as ridiculous when I found it, but worked more to make sense out of it.

There is much more to show and say, and study, and I hope some will get interested, look at it, tinker with it, think and wonder about it, and build on it. I hope distributed wisdom will self-organise around the evidence.

The use society makes of scientific discoveries is not the responsibility of scientists, but it is their duty to state their domain of validity. Mathematicians should clearly state the limits of mathematical models and AI scientist the limits of AI models, and all should stand against their abuse.

AI models include a larger number of input parameters, but it is still humans that select them and humans that set the criteria for their validation; and there will never be enough of them (that butterfly…) specially when it comes to human matters. They belong to the technology domain. Good servants but bad masters.

784 Dimensions to tell digits apart. How many for humans?

Hubris has seized yet another civilisation, this time under the guise of mathematics and its AI variant.

It is, historically, a sign of imminent collapse.

My attention is required elsewhere at this time, and then I will leave. I am a polymath that does not do maths, so I do not belong.


This is my message in a bottle. Let it be.


// you will find a civilised, full, version on the website. This is as compact as bearable
//  no, it is not perfectly clean
// setup the CPU_THREAD depending on your CPU. 4 is run of the mill and default
// to compile without GPU: gcc sonn.c -o sonn -O2 -pthread 
// if have a GPU: uncomment #define GPU and see your CUDA documentation.
//
//  throughout, variable ‘a’ is a group, ’n’ a neuron in a group, ‘s’ a connection in a neuron
//#define GPU 1
#define CPU_THREAD 4    
// warning without GPU: 180 hours with 4 threads to 98.9% , 18h to 98.65%

#include    <stdio.h>
#include    <stdlib.h>
#include    <fcntl.h>
#include    <unistd.h>
#include     <pthread.h>

// dataset
#define NTYPE 10
#define INSIZE 784
// size. NSYN: For 98.9%: 7920. For 98.65% : 792
#define NSYN 10
#define NSIZE 792    // multiple of CPU_THREAD
// params
#define THRESHOLD 1000000
#define THRESHOLD_MIN 300000
#define THRESHOLD_MAX 1200000
#define MIN_WEIGHT 97000

#define NCYC 4
unsigned char wd[4] = { 192, 128, 64, 1 } ;

int list[3][60000],listSize[3]={60000,10000,1000},listSet[3]={0,1,0},setsize[2] ;
unsigned char type[2][60000] , in[2][60000][INSIZE] ;
int notnuls[60000], nbc[NTYPE], nbcn[NTYPE][NSIZE], zero[60000*NTYPE], err[60000];
int nsp[3][60000][NTYPE] ;    // number of spikes per group and sample
int **weight[NTYPE],*tmpw ;
short int **from[NTYPE],*tmpf ;

// params
float quantP,quantUpP,quantN,quantUpN,quantDivP,quantDivN ;
int adjp,adjn,lossw,losswNb,nbNew,divNew,briefStep ;
// globals
int nbu,nblearn,nbtested,listNb ;
int toreload[NTYPE] ;

// threading
struct thrd_arg {
    int from ;
    int to ;
    int trhdNo ;
} ;

struct timespec mtwait ;
int mtact_learn[CPU_THREAD] , mtact_test[CPU_THREAD] ;
struct thrd_arg thrd_args[CPU_THREAD],thrd_test_args[CPU_THREAD] ;
int thrd_test_first[CPU_THREAD],thrd_test_last[CPU_THREAD] ;
int mtadj, mtsample, mta ;

int rnd( int max ) { return rand() % max ; }

void init()
{
    int a,n,set,l ;

    for( a=0 ; a<NTYPE ; a++ ) {
        weight[a] = (int **)calloc(NSIZE , sizeof(int *)) ;
        from[a] = (short int **)calloc(NSIZE , sizeof(short int *)) ;
        for( n=0 ; n<NSIZE ; n++ ) {
            weight[a][n] = (int *)calloc(NSYN , sizeof(int)) ;
            from[a][n] = (short int *)calloc(NSYN , sizeof(short int)) ;
        }
    }
    tmpw = (int *)calloc( NTYPE*NSIZE*NSYN , sizeof(int) ) ;
    tmpf = (short int *)calloc( NTYPE*NSIZE*NSYN , sizeof(short int) ) ;
    for( set=0 ;set<2 ;set++ ) for( l=0 ; l<setsize[set] ; l++ ) list[set][l] = l;
}

int compinc( const void *a , const void *b )
{
    if( *((int *)a) > *((int *)b) )    return 1 ;
    else                            return -1 ;
}

void quant( int sample )
{
    int is,isnot,tot,a ;
    float d ;
    
    is = type[0][sample] ;
    
    if( (float)( err[sample] ) != quantP ) {
        if( (float)( err[sample] ) >= quantP )     quantP += quantUpP / quantDivP ;
        else                                    quantP -= 1.0 / quantDivP ;
    }

    tot = 0  ; for( a=0 ; a<NTYPE ; a++ ) tot += nsp[2][sample][ a ] ;
    
    for( isnot=0 ; isnot<NTYPE ; isnot++ ) {
        if( isnot != is ) {
            d = (float)(nsp[2][sample][ isnot ]-(tot -nsp[2][sample][ isnot ])/9);
            if( d != quantN ) {
                if( d >= quantN )    quantN += quantUpN / quantDivN ;
                else                quantN -= 1.0 / quantDivN ;
            }
        }
    }
}

#ifdef GPU
cudaStream_t streams[60000] ;
short int *gpu_from ;
int *gpu_weight, *gpu_nsp, *gpu_list[3], *gpu_nspg, *gpu_togrp ;
unsigned char *gpu_wd , *gpu_in[2] ;

__global__ void Kernel_Test( int nsize, int nsyn, int ncyc, short int *from, 
            int *weight, unsigned char *in, unsigned char *wd, int *nsp, 
            int listNb, int *list, int sh, int nsh )
{
    int s,n,index,ana,tot,ns,cycle,off7sp,sp ;

    n = blockIdx.x*blockDim.x +threadIdx.x ;
    ns = n*nsyn ; ana = n / nsize ; n = n -nsize*ana ;
    if( n<nsize && ana<NTYPE ) {
        for( index=sh ; index<sh+nsh ; index++ ) {
            off7sp = list[index]*INSIZE ; tot = 0 ; sp = 0 ;
            for( cycle=0 ; cycle<ncyc ; cycle++ ) {
              for( s=0 ; s<nsyn ; s++ )
                tot += (in[off7sp +from[ ns +s ]] >= wd[cycle]) *weight[ ns +s ] ;
                
              if( tot > THRESHOLD ) { sp++ ; tot = 0 ; }
              else tot >>=1 ;
            }
            if( sp ) atomicAdd( nsp + index*NTYPE +ana , sp ) ;
        }
    }
}

void init_gpu()
{
    int set,i ;
    
    cudaSetDevice(0) ;

    cudaMalloc((void **)&gpu_wd , NCYC *sizeof(unsigned char) ) ;
    for( set=0 ; set<2 ; set++ ) 
     cudaMalloc((void **)&gpu_in[set], setsize[set]*INSIZE*sizeof(unsigned char));
    for( set=0 ; set<3 ; set++ ) 
        cudaMalloc((void **)&gpu_list[set], listSize[set] *sizeof(int) ) ;
    cudaMalloc((void **)&gpu_from, NTYPE*NSIZE*NSYN *sizeof(short int) ) ;
    cudaMalloc((void **)&gpu_weight, NTYPE*NSIZE*NSYN *sizeof(int) ) ;
    cudaMalloc((void **)&gpu_nsp, setsize[0]*NTYPE *sizeof(int) ) ;
    cudaDeviceSynchronize() ;

    cudaMemcpy( gpu_wd , wd , NCYC *sizeof(unsigned char),cudaMemcpyHostToDevice);
    for( set=0 ; set<2 ; set++ )
    {
        for( i=0 ; i<setsize[set] ; i++ ) 
            cudaMemcpy( gpu_in[set] + i*INSIZE , in[set][i] , INSIZE 
                            *sizeof(unsigned char) , cudaMemcpyHostToDevice ) ;
        cudaMemcpy( gpu_list[set] , list[set] , setsize[set] 
                            *sizeof(int) , cudaMemcpyHostToDevice ) ;
    }
    for( i=0 ; i<6000 ; i++ ) cudaStreamCreate(&(streams[i])) ;
    cudaDeviceSynchronize() ;
}

void test_gpu()
{
    int i,a,n,b,is,isnot,ok,sample,g,bssize,startat,set,l ;
    static int tmpn[NTYPE*60000] ; 

    set = listSet[ listNb ] ;
    for( a=0 ; a<NTYPE ; a++ ) {    // upload from & weight
        if( toreload[a] == 1 ) {
            for( n=0 ; n<NSIZE ; n++ ) {
                bcopy( from[a][n] , (void *)(tmpf +(a*NSIZE+n)*NSYN) ,  NSYN
                    *sizeof(short int) ) ;
                bcopy( weight[a][n] , (void *)(tmpw +(a*NSIZE+n)*NSYN) ,  NSYN
                    *sizeof(int) ) ;
            }
            cudaMemcpy( gpu_from +a*NSIZE*NSYN , tmpf +a*NSIZE*NSYN , NSIZE*NSYN
                *sizeof(short int) , cudaMemcpyHostToDevice ) ;
            cudaMemcpy( gpu_weight +a*NSIZE*NSYN , tmpw +a*NSIZE*NSYN , NSIZE*NSYN
                *sizeof(int) , cudaMemcpyHostToDevice ) ;
            toreload[a] = 0 ;
        }
    }
        
    if( listNb == 2 ) cudaMemcpy( gpu_list[listNb], list[listNb], listSize[listNb] 
        *sizeof(int) , cudaMemcpyHostToDevice ) ;    // upload list
    cudaMemcpy( gpu_nsp , zero , NTYPE*setsize[0] 
        *sizeof(int) , cudaMemcpyHostToDevice ) ;
    cudaDeviceSynchronize() ;

    // run
    if( listNb == 2 )    bssize = listSize[listNb] ;
    else                bssize = 100 ;
    
    for( startat=0 ; startat<listSize[listNb] ; startat+=bssize )
        Kernel_Test<<<( (NTYPE*NSIZE+ 31) ) /32, 32 ,0,streams[startat]>>>
        ( NSIZE, NSYN, NCYC, gpu_from, gpu_weight, gpu_in[set], gpu_wd, gpu_nsp, 
          listNb ,gpu_list[listNb] ,startat ,bssize ) ;
    cudaDeviceSynchronize() ;

    // download nsp
    cudaMemcpy( tmpn , gpu_nsp , listSize[listNb] *NTYPE
        *sizeof(int) , cudaMemcpyDeviceToHost ) ;
    cudaDeviceSynchronize() ;

    // dispatch
    for( l=0 ; l<listSize[listNb] ; l++ ) {
        if( listNb == 2 )
            bcopy( tmpn +l*NTYPE, nsp[2][ list[listNb][l] ], NTYPE*sizeof(int));
        else
            bcopy( tmpn +l*NTYPE, nsp[set][ l ], NTYPE*sizeof(int) ) ;
    }
}
#endif

void readset( int set , char *name_lbl , char *name_data )
{
    int nn,n,fdl,fdi ;
    
    setsize[set] = listSize[set] ;
    fdl = open( name_lbl , O_RDONLY ) ; lseek( fdl , 8 , SEEK_SET ) ;
    fdi = open( name_data , O_RDONLY ) ; lseek( fdi , 16 , SEEK_SET ) ;
    
    for( n=0 ; n<listSize[set] ; n++ ) {
        read( fdl , type[set] +n , 1 ) ;
        read( fdi , in[set][n] , INSIZE ) ;
        if( set == 0 )  {    
            notnuls[n] = 0 ; 
            for( nn=0 ; nn<INSIZE ; nn++ ) 
                if( in[set][n][nn] ) notnuls[n]++ ; 
        }
    }
    close(fdl) ; close(fdi) ;
}

void *test_mt_one( void *args )
{
    struct thrd_arg *arg ;
    int trhdNo,a,n,t,c,s,r,l,set,sample ;

    arg = (struct thrd_arg *)args ;
    trhdNo = arg->trhdNo ;

    while( 1==1 ) {
        while( ! mtact_test[arg->trhdNo] ) nanosleep( &mtwait , NULL ) ;

        set = listSet[ listNb ] ;
        for( l=thrd_test_first[trhdNo] ; l<thrd_test_last[trhdNo] ; l++ ) {
            sample = list[listNb][l] ;
            for( a=0 ; a<NTYPE ; a++ ) {
                nsp[listNb][sample][a] = 0 ;
                for( n=0 ; n<NSIZE ; n++ ) {
                    t = 0 ;
                    for( c=0 ; c<NCYC ; c++ ) {
                        for( s=0 ; s<NSYN ; s++ )
                            if( in[set][sample][ from[a][n][s] ] >= wd[c] )
                                t += weight[a][n][s] ;
                        
                        if( t > THRESHOLD ) { nsp[listNb][sample][a]++ ; t = 0 ; }
                        else t >>=1 ;
                    }
                }
            }
        }
        
        mtact_test[trhdNo] = 0 ;
    }
}

int test_mt()
{
    int trhdNo,done ;
    int set,ok,l,is,b,isnot,sample,a ;
    
    l = 0 ;
    a = listSize[listNb] / CPU_THREAD ; b = listSize[listNb] - a*CPU_THREAD ;
    for( trhdNo=0 ; trhdNo<CPU_THREAD ; trhdNo++ ) {
        thrd_test_first[trhdNo] = l ;
        thrd_test_last[trhdNo] = l+a ;
        if( trhdNo < b) thrd_test_last[trhdNo]++ ;
        l = thrd_test_last[trhdNo] ;
        mtact_test[trhdNo] = 1 ;
    }
    
    done = 0 ;
    while( done != CPU_THREAD ) {
        nanosleep( &mtwait , NULL ) ;
        done = 0 ; 
        for( trhdNo=0 ; trhdNo<CPU_THREAD ; trhdNo++ ) 
            done += 1-mtact_test[trhdNo] ;
    }
}

int test() 
{
    int set,ok,l,is,b,isnot,sample,a ;

#ifdef GPU
    test_gpu() ;
#else
    test_mt() ;
#endif
    set = listSet[ listNb ] ; ok = 0 ;
    for( l=0 ; l<listSize[listNb] ; l++ ) {
        sample = list[listNb][l] ; is = type[set][sample] ; b = -1 ;
        for( a=0 ; a<NTYPE ; a++ ) {
            if( a != is && nsp[listNb][sample][a] >b  )
                { b = nsp[listNb][sample][a] ; isnot = a ; }
        }
        if( nsp[listNb][sample][ is ] > b ) ok++ ;
        if( listNb == 2 ) {
            err[sample] = b - nsp[2][sample][ is ] ;
            quant( sample ) ;
        }
    }
    return ok ;
}

void test_mt_init()
{
    int trhdNo ;
    pthread_t thrds[CPU_THREAD] ;
    
    for( trhdNo=0 ; trhdNo<CPU_THREAD ; trhdNo++ ) {
        thrd_test_args[ trhdNo ].trhdNo = trhdNo ;
        pthread_create( &thrds[trhdNo],NULL,test_mt_one,
            (void *)&thrd_test_args[ trhdNo ] ) ;
    }
}

void loss( int b )
{
    int a,n,s,w ;
        
    for( a=0 ; a<NTYPE ; a++ ) {
        for( n=0 ; n<NSIZE ; n++ ) {
            for( s=0 ; s<NSYN ; s++ ) {
                w = weight[a][n][s] ;
                if( w ) {
                    if( lossw && b%losswNb == 0 ) 
                        { if( w >0 ) w -= lossw ; else w += lossw ; }
                    if( abs(w) < MIN_WEIGHT ) { w = 0 ; nbc[a]-- ;nbcn[a][n]-- ; }
                    weight[a][n][s] = w ;
                }
            }
        }
        toreload[a] = 1 ;
    }
}

void connect( int nb , int sample , int a , int init , int g )
{
    int p,p0,f,f0,n,s,i,ii,pn ;
    //short int f0 ;
    static int ps[1000] ;
    
    if( nbc[a] == NSIZE*NSYN )    return ;
    if( NSIZE*NSYN - nbc[a] < nb )    nb = NSIZE*NSYN - nbc[a] ;
        
    for( i=0 ; i<nb ; i++ )  {
        ii = -1 ;
        while( i != ii ) {
            ps[i] = rnd( NSIZE*NSYN - nbc[a] ) +1 ;
            for( ii=0 ; ii<i ; ii++ ) if( ps[ii] == ps[i] ) break ;
        }
    }
    qsort( ps , nb , sizeof(int) , compinc ) ;
    
    n = 0 ; s = 0 ; p0 = 0 ;
    
    for( i=0 ; i<nb ; i++ ) {
        p = ps[i] ;
        for( ; n<NSIZE && p0!=p ; n+=(p0!=p) ) {
            if( p0 + NSYN - nbcn[a][n] < p )
                p0 += NSYN - nbcn[a][n] ;
            else
                for( s *= (pn==n) ; s<NSYN && p0!=p ; s+=(p0!=p) )
                    if( ! weight[a][n][s] )
                        p0++ ;
        }

        f = rnd( notnuls[sample] ) +1 ;
        for( f0=0 ; f0<INSIZE && f ; f0+=(f!=0) )
            if( in[0][sample][f0] )
                f-- ;
        
        from[a][n][s] = f0 ; weight[a][n][s] = init ;
        nbc[a]++ ; nbcn[a][n]++ ;
        pn = n ;
    }
}


void *learn_mt_one( void *args )
{
    struct thrd_arg *arg ;
    int to,start,trhdNo ;
    int a,n,c,s,sample,adj ;
    int tot,cnta,mask ;

    arg = (struct thrd_arg *)args ;
    while( 1==1 )
    {
        while( ! mtact_learn[arg->trhdNo] ) nanosleep( &mtwait , NULL ) ;

        for( n=arg->from ; n<arg->to ; n++ ) {
            tot = 0 ; cnta = 0 ;
            for( c=0 ; c<NCYC ; c++ ) {
                mask = 1 ;
                for( s=0 ; s<NSYN ; s++ ) {
                    if( weight[mta][n][s] 
                      && in[0][mtsample][ from[mta][n][s] ] >= wd[c] ) {
                        tot += weight[mta][n][s] ;
                        cnta |= mask ;
                    }
                    mask <<=1 ;
                }
                
                if( tot > THRESHOLD_MIN ) {
                    if( tot < THRESHOLD_MAX ) {
                        mask = 1 ;
                        for( s=0 ; s<NSYN ; s++ ) {
                            if( (cnta & mask) != 0 && weight[mta][n][s] ) {
                                if( abs(weight[mta][n][s] +mtadj) < THRESHOLD ) {
                                    weight[mta][n][s] += mtadj ;
                                    if( abs(weight[mta][n][s]) < MIN_WEIGHT )
                                        {weight[mta][n][s] = 0 ; nbcn[mta][n]-- ;}
                                }
                            }
                            mask <<=1 ;
                        }
                    }
                    tot = 0 ; cnta = 0 ;
                }
                tot >>=1 ;
            }
        }
        mtact_learn[arg->trhdNo] = 0 ;
    }
}

void learn( int sample , int a , int adj )
{
    int trhdNo,done,n ;

    mta = a ; mtadj = adj ; mtsample = sample ;
    for( trhdNo=0 ; trhdNo<CPU_THREAD ; trhdNo++ ) mtact_learn[trhdNo] = 1 ;

    done = 0 ;
    while( done != CPU_THREAD ) {
        nanosleep( &mtwait , NULL ) ;
        done = 0 ; 
        for( trhdNo=0 ; trhdNo<CPU_THREAD ; trhdNo++ ) 
            done += 1-mtact_learn[trhdNo] ;
    }
    nbc[mta] = 0 ; for( n=0 ; n<NSIZE ; n++ ) nbc[mta] += nbcn[mta][n] ;
}

void learn_mt_init()
{
    int start,trhdNo,sssize ;
    pthread_t thrds[CPU_THREAD] ;
    
    sssize = (NSIZE +CPU_THREAD-1) / CPU_THREAD ;
    start = 0 ; trhdNo = 0 ;

    while( start < NSIZE ) {
        thrd_args[ trhdNo ].from = start ;
        thrd_args[ trhdNo ].to = start+sssize ;
        if( start+sssize >= NSIZE ) thrd_args[ trhdNo ].to = NSIZE ;
        thrd_args[ trhdNo ].trhdNo = trhdNo ;
        pthread_create( &thrds[trhdNo], NULL, learn_mt_one, 
            (void *)&thrd_args[ trhdNo ] ) ;
        start += sssize ; trhdNo++ ;
    }
}

void batch()
{
    int b,bu,a,l,sample,is,isnot,g,tot,led,prevStep ;
    static int sel[60000] ;
    float d ;
    
    prevStep = 0 ; nbtested = 0 ; bu = 0 ; b = 0 ;
    while( 1 == 1 )
    {    // batch selection
        bcopy( zero , sel , 60000*sizeof(int) ) ; // no duplicate
        for( l=0 ; l<listSize[2] ; l++ )
        {
            list[2][l] = rnd(60000) ;
            while( sel[ list[2][l] ] ) list[2][l] = rnd(60000) ;
            sel[ list[2][l] ] = 1 ;
        }

        listNb = 2 ; test() ; nbtested++ ;
            
        for( l=0 ; l<listSize[2] ; l++ ) {
            sample = list[2][l] ; is = type[0][sample] ;
            
            if( (float)(err[sample]) >= quantP || nbtested>b /*lim.@start*/) {
                connect( nbNew , sample , is , THRESHOLD/divNew , g ) ;
                learn( sample , is , adjp ) ;
                toreload[is] = 1 ; nblearn++ ; b++ ;
                if( lossw && nblearn%losswNb == 0 ) loss( b ) ;
            }
            
            if( bu > 3*b/2 ) continue ; // limiter for faster startup.

            tot = 0  ; for( a=0 ; a<NTYPE ; a++ ) tot += nsp[2][sample][ a ] ;
            for( isnot=0 ; isnot<NTYPE ; isnot++ ) {
              if( isnot != is ) {
                d = (float)(nsp[2][sample][isnot]-(tot-nsp[2][sample][isnot])/9) ;
                if( quantUpN != 0.0 && d >= quantN )
                {
                    connect( nbNew , sample , isnot , -THRESHOLD/divNew , g );
                    learn( sample , isnot , adjn ) ;
                    toreload[isnot] = 1 ; bu++ ;
                }
              }
            }
            if( /*b/briefStep >prevStep*/ b%briefStep == 0 && b != prevStep ) {
                printf("testing %d: ",nblearn) ;
                listNb = 0 ; printf("0: %2.3f  ", (float)test() /600.0) ;
                listNb = 1 ; printf("1: %2.2f\n", (float)test() /100.0) ;
                prevStep = b /*/briefStep*/ ;
            }
        }
    }
}

int main()
{
    srand( 1000 ) ;    // time(NULL) for random
    setbuf(stdout, NULL);    // unbuffering stdout
    mtwait.tv_sec = 0 ; mtwait.tv_nsec = 10 ;
    // Download the files at www.yann.lecun.com/​​
    readset( 0 , ”./​train-labels-idx1-ubyte” , ”./​train-images-idx3-ubyte” ) ;
    readset( 1 , ”./​t10k-labels-idx1-ubyte” , ”./​t10k-images-idx3-ubyte” ) ;
    init() ;
#ifdef GPU
    init_gpu() ;
#else
    test_mt_init() ;
#endif
    learn_mt_init() ;
    
    //​ params
    briefStep = 20000 ;    //​ increase to speed up
    listSize[2] = 200 ;
    quantP = 0.0 ; quantDivP = 100 ; quantUpP = 19.0 ;
    quantN = 0.0 ; quantDivN = 900 ; quantUpN = 150.0 ; //​ 98.9 : quantDivP=100
    adjp = 500 ; adjn = −500 ;		//​ 98.9 : 1000 /​ −1000
    divNew = 10 ; nbNew = 10 ;    	//​ 98.9 : 10 /​ 20
    lossw = 20 ; losswNb = 100 ;    //​ 98.9 : 10 /​ 100
    
    batch() ;
}