We sought to utilize advances in computer vision (CV) and deep learning (DL), and move from single concepts cells to the level of neuronal populations for identifying characters1–7. Nine neurosurgical patients (Supplementary Table 1) watched a 42 min movie (first episode, season six of “24” TV series). Following the movie viewing, participants were tested for recognition memory by showing them multiple short clips of targets (taken from the same episode) and foils (taken from another episode of the same TV series) and were instructed to mark whether they had seen the clip (Fig. 1a; Methods). These participants were implanted with multiple depth electrodes as part of the clinical procedure for seizure monitoring and possible resection of the epileptogenic tissue. We recorded single unit activity from 385 neurons from multiple brain regions (Methods; Supplementary Tables 2, 3)3,8,9,10.
Schematic of the task and the pipeline for semi-supervised character extraction. (a.) The task consisted of viewing an episode of the 24 TV series followed by a recognition memory test. During the memory test, participants were shown short clips and were asked whether they had previously seen the clip or not. (b.) A brief overview of the algorithm used for character extraction. For a more detailed version, see Supplementary Fig. 1) (c.) Example outputs of the character extraction algorithm at different steps: Left) output of stage 1 (extracting humans in each frame); Right) output of stage 4 (sample different appearances for each of the four main characters C.1 through C.4) (d.) To test the performance of our character extraction algorithm, we used character labels that were manually created in an independent study7. Shown are the normalized confusion matrices of individual characters (right) as well as the overall confusion matrix (left). Yp and Np correspond to predicted yes and predicted no respectively. Note the darker colors along the diagonal showing large percentages of true positives and true negatives indicative of high performance.
We first examined whether it is possible to decode the presence or absence of individual characters throughout the movie from the neuronal population responses despite the highly variable physical appearance and context. As a first step towards generating training and test sets for the decoding task, we developed a semi-supervised algorithm that labeled the presence or absence of nine important characters (defined as characters that appeared for at least 1% of the episode) in each frame (Fig. 1b, c; Supplementary Fig. 1; Methods). Briefly, the associated pipeline involved: (1) extraction of humans in each frame using a pre-trained YOLO-V3 network11, (2) spatio-temporal tracking of detected humans to form clusters of image crops belonging to the same character, (3) grouping of these spatio-temporal image clusters into nine important character identities, based on facial features; (this is the only part that required manual intervention), and (4) training a ten-label Convolutional Neural Network (CNN) with the automatically learned examples to identify the presence or absence of each of the nine characters in each frame (the tenth label corresponded to “other”). Of the nine character identities created by our algorithm, we picked the four most prominent characters to be decoded using neural data; henceforth, to be referred to as C.1, C.2, C.3, and C.4 (This allowed us to have sufficient data points for training and testing in the later steps; see Methods; Supplementary Fig. 2; Supplementary Table 4). Prior work7 had segmented the same movie into a set of shots (or cuts, defined as consecutive frames between sharp transitions), and manually labeled each shot (as opposed to the individual frames in our case) with the names of the characters in it. By aggregating frame level labels over each shot, we benchmarked the performance of the automated method against the manual labels (results are shown in Fig. 1d).
Having established continuous (as opposed to a cut-level human annotation) labels for the visual presence of characters in each frame (using our semi-supervised method), we asked whether it is possible to find neural footprints for each of the four main characters in our electrophysiological data for each participant in our study. A footprint of a character is a discriminative pattern in neural data responding to the presence of the character, such that the pattern appears if and only if the character is present, and hence a decoder can be trained. Furthermore, these footprints are participant-specific, given that each participant had a unique set of recording sites. To build such a decoder, for any given frame with a character in it, we created a candidate footprint feature vector comprising the firing rates of all active neurons from all regions during a two-second interval around the frame (one second before and one second after for each target frame; see Methods). The exact number of neurons and regions for each participant can be found in Supplementary Tables 2, 3.
To implement the decoder, we first used, and optimized, a two-layer Long Short Term Memory (LSTM) network12 (Fig. 2a, Methods; Supplementary Table 5). The final layer outputs the probabilities of the four main characters in each frame, which during evaluation were further binarized into presence (“yes” label) or absence (“no” label) predictions (Fig. 2a). We used a 5-fold cross-validation method, with 70% of the data used for training, 10% for validation, and 20% for testing (frames were randomized and were independent in each set). It is worth noting that the true data labels were highly unbalanced and the characters were at most visually present in only around 20% of the frames, (Supplementary Table 4). As such, without using higher weights for the loss function corresponding to “yes” labels, one would expect that the performance would converge to a misleadingly high accuracy of 80% but would yield 0% for character detection. However, we obtained good decoding performance both in terms of accuracy and character detection, thus indicating significant latent character information in the neural data (Methods).
Decoding visual presence of the four main movie characters using neural recordings as inputs. (a) The structure of the LSTM network used for classification. Here, input data was the firing sequence (colormap; brighter shades indicate higher firing) of all neurons (x-axis) from a participant within a two-second window around each frame (y-axis: time; in this case 6 seconds are shown with data 1-3 being example representatives of 2 seconds around each frame with 60 time-steps). Firing rate maps were sequentially passed through two LSTM layers followed by two fully connected (FC) layers to output a probability distribution over the four main characters in that frame of the movie. Note that the predictions occur on a frame by frame basis. (b) Frame-by-Frame comparison of character labels generated by CV and neuronal vision (LSTM classifier): The labels generated by the computer vision algorithm (green vertical lines; each row is a different character) and the LSTM (blue dots: true positive, TP; red dots: false positive, FP) for each movie frame are plotted as a function of time. The significant overlap between the two labels (i.e., green lines and blue circles, large number of true positives) illustrates the goodness of the decoding algorithm. (c) In an example participant, the normalized confusion matrices for the binary classification task for all the four characters are shown. Each row indicates the true labels (Y: character present in the frame; N: character not in the frame) and each column indicates the predicted labels by the classifier. The large numbers on the diagonals (high true positive rate (TPR) and true negative rate (TNR)) of all the four matrices shows that the LSTMs achieve high accuracy in decoding all the four characters. (d) The distribution of the entries of the confusion matrix over all participants is shown as a bar plot (mean) with error bars (std) for all four characters. The high mean and low standard deviation for the TPR and TNR values in all the four matrices show that the LSTM achieves high accuracy in decoding all the four characters across participants. (e-f.) Accuracy (e) and F1-scores (f) for decoding each character are shown with each colored dot indicating different participants (Pt1 through Pt9). The consistently high accuracy and F1-scores across participants indicate that the LSTM generalizes well in this decoding task. The lines and shaded areas (mean ± STD) indicate the performance of the chance model (obtained from shuffling labels) across all participants. Note that the chance level for accuracy is at 80% due to the unbalanced nature of the data. For instance, given that a character is present only in 20% of the frames, predicting a “N” for all frames would yield 80% correct predictions.
For each participant, the performance of the decoder was first visualized by comparing the frame level predictions for each character against the corresponding true labels (Fig. 2b), and was further quantified by a normalized confusion matrix (Fig. 2c). Performance of the LSTM decoder, as quantified by the F1-scores—a measure that is more appropriate for unbalanced datasets—was on average 7-times better than that of a distribution-based decoder (such as Naive Bayes (Methods); Supplementary Table 6). This suggests the presence of strong and unique discriminative character footprints in the neural data. The distribution of the entries of the normalized confusion matrix (Fig. 2d), the plots of the Accuracy (Fig. 2e) and F1-scores (Fig. 2f) across all nine participants, as well as the table of Recall, Precision, F1-scores, and Accuracy (Supplementary Table 7) showed consistently good results. Lastly, to further ensure that our results could not arise by chance, we performed a shuffling procedure in which the character labels were randomized with respect to the neural data and the participant-specific models were retrained and re-evaluated. Here, too, the performance of the true model was far above the performance of the chance model (Figs. 2e,f; shaded region). Next, we addressed the effect of faulty labels in the output of the semi-supervised computer vision algorithm on the performance of the neural decoder (“neuronal vision”). We used the cut-level human annotations as the ground truth. Although the neural decoder was trained and tested using the computer vision (CV) labels, which albeit close to the ground truth (manually-labeled data) contained a small number of faulty labels (Fig. 1c), we found that in the case of recall (i.e. true positive rate), neural vision outperformed the computer vision results (p = 5.05 × 10−3, Signrank test).
Since Neural Network (NN) architectures, such as the LSTM, have high representational capabilities, several different converged models (corresponding to different minima for the same training data) could give comparable end-to-end performance but could produce dramatically different results when the models are used to determine functional properties of the underlying physical systems. In our case, for example, we intended to use the learned models to determine the regions and subregions carrying the most relevant information in the decoding of different characters. Therefore, we used an entirely different NN architecture, namely a convolutional neural network (CNN) model, where the time series training data around each frame was converted into an image (Methods; Supplementary Table 8). The exact same tasks were replicated for both LSTM and CNN networks. Indeed, the CNN model reached comparable high performance to that of the LSTM model in decoding characters (Supplementary Fig. 3; Supplementary Table 6) and yielded similar results to the LSTM pipeline in the subsequent analyses detailed below. The consistency of results between the two NN models (LSTM and CNN) is critical, especially when assessing the importance of different brain regions in the decoding process, since it ensures that the results are not merely an artifact of model optimizations. Lastly, the observation that both NN models performed significantly better than the other classifiers we tried (e.g., linear SVM; Methods) suggests that the nonlinear dynamics between the input features are a critical aspect of the decoding process that are not captured by traditional machine learning methods (Supplementary Table 6). It is possible, however, that if the recorded neurons were sampled from the relevant regions (e.g. face areas) or the degree of invariance in the presented stimuli was reduced (e.g. repeated still images of the characters under a few conditions), the traditional machine learning methods would perform satisfactorily as well.
Thus far, we used the activity of all of the recorded units within each participant as the input to the NN models. Next, we used a knockout analysis to determine the brain regions that were more critical than others in the decoding process. This knockout analysis is analogous to the analysis tool named Occlusion Sensitivity13 for inspecting NN image classifiers. Specifically, we evaluated the performance of our model (that was trained on units from all regions; base results) on data in which the activity of units from individual regions was eliminated one at a time (region knockout results). We used the change in the Kullback–Leibler divergence (KLD) loss due to the knockout, normalized by the number of neurons, as a proxy for how worse (or better) the model performs without units recorded from a specific region (Methods). The KLD loss was used because it is a more granular metric compared to other metrics, such as accuracy, in investigating the model’s response to such knockout perturbations.” We found that knockouts of different regions led to different changes in KLD loss, and we determined important regions to be those which, when knocked out, led to higher normalized KLD loss (Fig. 3a). Of the eleven regions, knocking out five of them resulted in the most notable (normalized) losses in decoding performance (occipital, entorhinal cortex, parahippocampal, anterior cingulate, and superior temporal)(Fig. 3b). For each participant, we additionally verified this finding by re-training two independent NNs on neural data from important and less important regions and by comparing their performance. The two separate models, one trained only using the units from regions that were deemed important, and another one trained only using the remainder of the units (Methods), showed a significant difference in their decoding performance (p = 0.01, Wilcoxon ranksum test comparing the F1-scores of the two re-trained models).
Identification of important regions in decoding characters: “The Whole Is Greater Than the Sum of its Parts.” (a) The change of KLD loss for each character (row) after knocking out a given region (column) for one participant is shown (Region Knockout). The value is normalized by the number of neurons in that region and demonstrates how the model performance deteriorated when excluding the units recorded from that region. Important regions are those with higher KLD loss values. L and R correspond to the left and right hemisphere, respectively. A: Amygdala, AC: Anterior Cingulate, EC: Entorhinal Cortex, MH: Middle Hippocampus, VMPFC: Ventro- medial Prefrontal Cortex. (b) The changes in KLD loss after knocking out regions are shown across participants. Different colored dots correspond to the changes in KLD loss for different characters. Bars indicate the median value of the change in KLD loss after region knockout. The following regions resulted in the most notable losses in decoding performance: anterior cingulate (36.11, [20.82, 53.78]%), entorhinal cortex (42.50, [27.04, 59.11]%), occipital (65.00, [40.78, 84.61]%), parahippocampal (37.50, [8.52, 75.51]%), and superior temporal (33.33, [9.92, 65.11]%). Reported are the percentage of losses above 0.5 (as well as the binomial fit confidence intervals).(Nparticipants = 9). (c) The change in KLD loss for each character (row) after knocking out a given electrode (column) at a time is shown (Electrode Knockout) for an example participant (same as in a). Similar to the region knockout results in (a), the loss value is normalized by the number of units recorded on each electrode. (d) The sum of the changes in KLD loss following electrode knockout (all electrodes within a region) was subtracted from the change in KLD loss following region knockout. Shown are these values for the four different characters (rows) from an example participant (same as in a and c). Positive values indicate that knocking out a whole region deteriorates the model performance to a greater extent. (e). When considering all regions from all participants, in most regions, the region knockout loss was greater than the sum of electrode knockout loss (each column, and its associated colormap, is the distribution of this measure and the red horizontal line indicates the median of the distribution for those that were significantly different from zero) as quantified by Wilcoxon signed-rank tests (*:p < 0.05; **: p < 0.01; ***: p < 0.001).
We asked whether the co-activation pattern among the neurons within each region was contributing to the decoding performance. We defined the incremental information content of a neuron as the increase of the KLD loss on removing the neuron’s activity while preserving the rest of the system. We observed that the information content of a set of neurons, that is, the increase in the KLD loss by removing all the neurons in the set, is larger than the one obtained by summing up the incremental information contents of the individual neurons. This was shown by applying the same knockout analysis, where we knocked out the activity of all the neurons recorded on individual electrodes (microwires; 8 per region; electrode knockout results) one at a time and evaluated the KLD loss (Fig. 3c). Our analysis showed that the resulting increase in the KLD loss from region knockout (i.e., when all the units in a given region were knocked out together) was greater compared to when the increases in KLD losses from electrode knockout within a region were added together (Fig. 3d) in most regions (Fig. 3e; P < 0.05 for eight out of eleven regions, Wilcoxon signed-rank test). These results were replicated using the CNN network as well (Supplementary Fig. 4). This finding of “the whole is greater than the sum of its parts” may indicate that the neurons’ dynamics and inter-relations across a region may also contribute to the decoding performance.
After quantifying the model performance during movie viewing, we examined how the model fared during the memory test following the movie (Fig. 1a; Methods). It should be borne in mind that because of the nature of the recognition task—where participants have to decide whether they have seen the clip or not—participants would most likely remember parts of the movie plot beyond those displayed during the clips. Thus, any well-trained decoder may predict the presence of characters not necessarily visually present in the clip itself. As such, when any such decoder is evaluated by whether it predicts the characters in the clips, it might lead to more false positives (FPs) compared to the movie viewing time, therefore lowering the accuracy. Indeed, as reported in the following, we observed that our decoder led to an increased number of FPs and, as expected, the accuracy of our model was lower during the memory task compared with the movie (~ 67% compared to ~ 95%). We discovered, however, that the accuracy of the model was positively correlated with both the percentage of the time the character was present in the clip, as well as, the size of the character in the frames (Fig. 4a), which might be expected given that both are measures of how prominent the character is.
Properties of the NN model trained for decoding characters in the movie during the memory test. (a) The accuracy of the NN model (trained during movie viewing) in decoding the visual presence of the characters in the clips during the memory test improved as a function of both the “size” of the character in the clips (defined as the percentage of the pixels that the character occupied in a given frame; reference: half frame) as well as the prevalence of the character in the clip (defined as the percentage of the clip time during which the character was present). Shown are the mean ± SEM of the model accuracy at different thresholds across participants. (b) Model activation as a function of time (averaged across five folds) was separated for the clips during which the character was present (character in; left) and all other clips without the character (character out; right; a random subset of the clips are shown in the panel). Note that during the clips that contained the character the model activation appeared higher. The example shown is from participant 6, activation for character 1. (c) Model activation as a function of time (with respect to the clip onset), was divided into a 2 × 2 matrix: clips with/without the characters x participants’s subjective memory of the clip (yes-seen/no-have not seen). Shown are the mean ± SEM of the model activation for each of the four groups of clips. (d) To test whether both the presence of characters and participants’ subjective memory influenced the model activation at the population level (for all participants and clips), we used a GLM method. We modeled the overall activation (during a two-second interval after clip onset) as a function of (1) whether the character was in the clip or not (estimated coeff. = 0.39, p = 8.02 × 10–6); and (2) whether the participant marked the clip as previously seen or not (coeff. = 0.24, p = 0.001). Both were significant factors in the model activation and the height of the bars and error bars correspond to the mean overall activation and the standard error across clips (from all participants) respectively. (e) Left) Character associations as quantified by the conditional probabilities of their cooccurrences during the movie. For example, row 1, column 4 represents P(character 4 | character 1. Right) Conditional probabilities of model activations for the characters. The structure follows the same pattern as in (left). (f) Conditional probabilities of model activations during clip viewing (e-Right) were significantly correlated with character associations (e-Left) in an example participant (Spearman correlation r = 0.67, p = 4.73 × 10–3). Please note that the values from the diagonals are excluded in this analysis. (g) Conditional probabilities of the model activation for characters were higher during response time compared to clip viewing (p = 0.016; sign-rank test).
Our NN models output a confidence level for each character in the range [0,1] at any given time; this confidence level is henceforth referred to as model activation. So far, while reporting metrics such as accuracy and F1-score, we followed the standard technique of binarizing this model activation: 1 if it is greater than 0.5 and 0 otherwise. However, model activation can provide more granular information and was, therefore, used to analyze the decoder performance during the memory task. We evaluated the NN model activations for each character as a function of time and, specifically, with respect to the clip onset time (Fig. 4b). We noted that the model activations, in addition to the visual presence or absence of the characters, were also related to participants’ subjective memory. Model activation during clip viewing was highest when the characters were in the clip and the participants later marked the clip as “seen before” during the response time. Conversely, model activation was lowest when the character was not in the clip and the participant marked the clip as “not seen before” (Fig. 4c).
To quantify this at the population level, we used a Generalized Linear Model (GLM) where we modeled character activations as a function of (1) the presence or absence of the character in the clip; (2) whether the clip was marked as seen or not; and (3) whether the clip belonged to the target or not (Methods), all as main terms. We found that both the presence of the character as well as the participants’ response to whether they had seen the clip were significant contributing factors in model activation (Fig. 4d; Character in: estimated coeff. = 0.39, p = 8.02 × 10−6; Clip marked as seen: estimated coeff. = 0.24, p = 0.001; GLM estimates). The effect of participants’ subjective memory of the clips on the model activation during memory was surprising given that the NN model was only trained to distinguish the visual presence or absence of the characters. In a separate GLM, we included the interaction term between character presence and participants’ subjective memory as an additional variable, which (to our surprise) was not a significant predictor of NN model activations. Can the model activation be influenced by how often the characters were seen before? When we included the characters’ appearance frequency (a proxy for how often each character was seen during the episode) as a variable to the GLM model, we did not find it to be a significant factor. Although by the end of the episode the participants may be too familiar with the main characters, it would be interesting to investigate this factor within the initial phases of learning.
We then performed the knockout analysis (described in the previous sections) and evaluated the model activations after knocking out the activity of all MTL neurons. Once again, we modeled NN activations as a function of participants’ subjective memory (i.e., whether they marked the clips as seen or not) as well as the characters’ presence in the clip. Here, subjective memory was no longer a significant predictor of the activations, i.e., only the character presence remained a significant factor (GLM estimated coeff. = 0.50, p = 9.50 × 10–9), which is consistent with the role of the MTL in the formation of memories14,15,16.
It has been shown previously that the formation of new associations are reflected in the firing pattern of single neurons9. Accordingly, we hypothesized that if two characters, for example characters 1 and 4, became associated during the movie plot, the activation of the NN model for both characters should be high (regardless of character presence or not). To test this, we used two measures: (a) Character associations within the plot: We computed the conditional probabilities of the characters, for example the probability of character 1 appearing given that character 4 appears in a given scene (a meaningful segmentation of the episode provided by an independent study7; Methods). Character associations, i.e., their conditional probabilities are summarized in Fig. 4e (left); (b) NN model coactivations: To compute whether model activations for the different characters overlapped (e.g., when the model activation was high for character 1, there was a higher activation for character 4 as well even in the absence of character (4), we used a similar approach to (a). We calculated the conditional probabilities for the characters predicted by our model in the clips (e.g., the probability of model activation being above 0.5 for character 1 given the model activation is above 0.5 for character 4)(Methods). NN model coactivations, i.e. model conditional probabilities for a given participant are shown in Fig. 4e (right). We found that the NN model coactivations for characters during clip viewing were positively correlated with character associations in the movie (example: Fig. 4f; for all participants: p < 0.05, Spearman correlation).
Lastly, we inspected the model activation during the response time (following the end of the clip and prior to the participants making a choice) when no visual information was provided to the participants, and noted that conditional probabilities of model activation for the different characters were significantly higher compared to those during clip viewing (p = 0.016; sign-rank test; Fig. 4g), which might suggest higher-order associations among the characters rather than first-order associations as observed during clip viewing, may be invoked during the response time.