Il pacchetto randomForest in R di A. Liaw è una porta del codice originale che è un mix di c-code (tradotto) del codice fortran rimanente e del codice wrapper R. Per decidere la migliore suddivisione complessiva tra punti di interruzione e variabili di mtry, il codice utilizza una funzione di punteggio simile a gini-gain:
GiniGain(N,X)=Gini(N)−|N1||N|Gini(N1)−|N2||N|Gini(N2)
XNN1N2N|.|
Gini(N)=1−∑Kk=1p2kK
Gini(N)
|N2||N|Gini(N2)∝|N2|Gini(N2)=|N2|(1−∑Kk=1p2k)=|N2|∑nclass22,k|N2|2
where nclass1,k is the class count of target-class k in daughter node 1. Notice |N2| is placed both in nominator and denominator.
removing the trivial constant 1− from equation such that best split decision is to maximize nodes size weighted sum of squared class prevalence...
score=
|N1|∑Kk=1p21,k+|N2|∑Kk=1p22,k=|N1|∑Kk=1nclass21,k|N1|2+|N2|∑Kk=1nclass22,k|N2|2
=∑Kk=1nclass22,k1|N1|−1+∑Kk=1nclass22,k1|N1|−2
=nominator1/denominator1+nominator2/denominator2
The implementation also allows for classwise up/down weighting of samples. Also very important when the implementation update this modified gini-gain, moving a single sample from one node to the other is very efficient. The sample can be substracted from nominators/denominators of one node and added to the others.
I wrote a prototype-RF some months ago, ignorantly recomputing from scratch gini-gain for every break-point and that was slower :)
If several splits scores are best, a random winner is picked.
This answer was based on inspecting source file "randomForest.x.x.tar.gz/src/classTree.c" line 209-250