System for automatic detection and classification of cars in traffic

Objective: To develop a system for automatic detection and classification of cars in traffic in the form of a device for au - tonomic, real-time car detection, license plate recognition, and car color, model, and make identification from video. Methods: Cars were detected using the You Only Look Once (YOLO) v4 detector. The YOLO output was then used for clas - sification in the next step. Colors were classified using the k-Nearest Neighbors (kNN) algorithm, whereas car models and makes were identified with a single-shot detector (SSD). Finally, license plates were detected using the OpenCV li - brary and Tesseract-based optical character recognition. For the sake of simplicity and speed, the subsystems were run on an embedded Raspberry Pi computer. Results: A camera was mounted on the inside of the wind - shield to monitor cars in front of the camera. The system processed the camera’s video feed and provided informa - tion on the color, license plate, make, and model of the ob - served car. Knowing the license plate number provides ac - cess to details about the car owner, roadworthiness, car or license place reports missing, as well as whether the license plate matches the car. Car details were saved to file and dis - played on the screen. The system was tested on real-time images and videos. The accuracies of car detection and car model classification (using 8 classes) in images were 88.5% and 78.5%, respectively. The accuracies of color detection and full license plate recognition were 71.5% and 51.5%, respectively. The system operated at 1 frame per second (1 fps). Conclusion: These results show that running standard machine learning algorithms on low-cost hardware may enable the automatic detection and classification of cars in traffic. However, there is significant room for improve - ment, primarily in license plate recognition. Accordingly, potential improvements in the future development of the

Methods: Cars were detected using the You Only Look Once (YOLO) v4 detector. The YOLO output was then used for classification in the next step. Colors were classified using the k-Nearest Neighbors (kNN) algorithm, whereas car models and makes were identified with a single-shot detector (SSD). Finally, license plates were detected using the OpenCV library and Tesseract-based optical character recognition. For the sake of simplicity and speed, the subsystems were run on an embedded Raspberry Pi computer.
Results: A camera was mounted on the inside of the windshield to monitor cars in front of the camera. The system processed the camera's video feed and provided information on the color, license plate, make, and model of the observed car. Knowing the license plate number provides access to details about the car owner, roadworthiness, car or license place reports missing, as well as whether the license plate matches the car. Car details were saved to file and displayed on the screen. The system was tested on real-time images and videos. The accuracies of car detection and car model classification (using 8 classes) in images were 88.5% and 78.5%, respectively. The accuracies of color detection and full license plate recognition were 71.5% and 51.5%, respectively. The system operated at 1 frame per second (1 fps).
Conclusion: These results show that running standard machine learning algorithms on low-cost hardware may enable the automatic detection and classification of cars in traffic. However, there is significant room for improvement, primarily in license plate recognition. Accordingly, potential improvements in the future development of the system are proposed.

Introduction
As a product of technological advancements and developments within the automotive industry, cars are now equipped with cutting-edge safety systems. Traffic accident reduction has been a driving force behind new safety systems as human error accounts for 94% of all auto accidents [1]. In 2016, within the European Union, traffic accidents accounted for 25,600 deaths and 1.4 million injuries [2]. Worldwide, between 1 and 1.24 million people die on the roads every year and 20-50 million people are injured [3]. This has triggered initiatives such as VisionZero [4], which aim to eliminate all traffic fatalities and severe injuries using various modalities (including active and passive car safety features).
Car safety systems are commonly referred to as Advanced Driver Assistance Systems (ADAS) and may feature different autonomy levels [5], from 0 (no automation) to 5 (full automation/autonomy). The parking assistant is one such advanced system projected for serial installation into all commercial vehicles by 2025, based on the National Highway Traffic Safety Administration (NHTS, USA) recommendations [6]. There are also plans to introduce systems such as traffic lane monitoring and pedestrian and traffic sign detection. ADAS has achieved significant reductions in car crash rates -as much as 78% for reversing collision [7]. Driving Monitoring and Assistance Systems (DMAS) are somewhat similar, but unlike ADAS, these focus on monitoring driver behavior [3]. Although this paper primarily focuses on technical features, user acceptance needs to be considered sideby-side when introducing any new technology [8,9]. This is especially important when it comes to artificial intelligence (AI)-based systems [10].
The limitations of car's own computation power need to be taken into consideration when implementing any real-time driving assistance systems. More specifically, these may be expanded, but this will take up more space. This is especially true of systems that source data from images and use machine learning for data processing [11]. Loce et al. [12] presented a good overview of widely available, computer vision-based ADAS/DMAS technologies, such as the lane departure warning system, pedestrian detection system, driver monitoring systems, etc. The authors also note that the application of these systems in road safety and traffic monitoring is possible (which is the application of the system proposed in this paper). A more recent overview of computer vision-based ADAS technologies with slightly different categorization is presented in Horgan et al. [13].
Car safety features (regardless of type) are reliable only as far as they are regularly inspected and serviced. Consequently, unregistered cars (and unroadworthy cars) present a particular traffic safety concern. According to an observational study conducted in Queensland, Australia, around 3% of cars on the roads were unregistered [14]. A significant correlation between unregistered vehicles and a high risk of causing accidents has also been observed. This has been explained by such vehicles not being roadworthy and the risky behavior of drivers who usually did not hold a valid driver's license. Comparable numbers of unregistered vehicles have also been observed on the roads in other parts of the world; for example, 3.38% in California, USA [15]. In Croatia, traffic police identified 17,832 unregistered (not roadworthy) vehicles during regular patrols in 2020, which was a 13.8% increase compared to 2019 [16].

st-open.unist.hr
Traditionally, unregistered cars have been detected using specialized Automatic Number Plate Recognition (ANPR) cameras installed along roads. Dark [17] used this method to collect data from 256 ANPR cameras installed across the UK, which were made publicly available for further analysis. The obvious disadvantage of these systems is that they are static and therefore easier to avoid for unconscientious drivers. Additional challenges for license plate recognition systems include 1) varying lighting conditions, including glare, shadows, and blur, 2) high-speed vehicles, and 3) various obstacles [18].
However, recent developments in electronics and optical systems have made these cameras more mobile and they can now be mounted within police vehicles. For example, the Croatian police use the Nero-ANPR system [19], produced in Croatia [20]. The system, launched in 2016 as a pilot project, costs around ten thousand euros and consists of a camera that is mounted on the roof of a police vehicle. The system can scan up to 10 vehicles per second (in day and night conditions) with 90% accuracy, even at speeds over 200 km/h [21]. Despite its impressive nominal features, the disadvantages of the system, including its high cost and rooftop installation (a more complicated installation setup) may preclude/ impede its wider application. The cost of these systems is considerable in other countries as well [22]: approximately 6 million AUD for about 450 units. Police in other countries uses similar systems, mostly to look for stolen cars [23]. Intriguing initiatives aiming to involve citizens in the detection and analysis of car license plates through interactive games have also been proposed [24].
The above discussion clearly demonstrates the rise in the usage of cameras and related computer vision algorithms in cars and car safety [25,26]. The groundbreaking development of neural networks during the last decade has also contributed to their application in image processing [27], where one of their most important functions is the detection of specific objects in images [28]. Object detection in images is a key technology behind advanced driver assistance systems. For example, these systems allow cars to detect lanes and other road markings [29] and thus prevent illegal lane cutting while driving. Additionally, they detect and track pedestrians in traffic [30] to improve road safety. Object detection is also useful in video surveillance, autonomous driving [31], robotic vision [32], and similar applications.
Deep neural networks, recurrent neural networks, and convolutional neural networks are just some of the many types of neural networks with practical applications in various technological and scientific fields [27]. Convolutional neural networks (CNN) are the backbone of modern object detection methods [28]. The name comes from a linear mathematical operation between matrices called convolution. CNNs consist of convolutional layers, nonlinear layers, pooling layers, and fully connected layers. The convolutional layer is a CNN's main layer and its task is to extract features from images, such as edges and colors. It can also detect more complex features when combined with other layers, such as various shapes, digits, or parts of the face. Like the convolutional layer, the pooling layer reduces the spatial dimension of a convoluted image. This, in turn, cuts down on the computing power required to process data because it downsizes the matrix dimension of an image [33]. Put together, the convolutional and pooling layer make up the i-th layer of a convolutional neural network. Depending on the complexity of an image, the number of these st-open.unist.hr 4 layers may be increased to reveal more details, but this will require more computing power. Fully connected layers form the last few layers in the network. The input to the fully connected layer is a flattened vector with features selected by the convolutional layer. The output is normally connected to the softmax function to normalize the output and produce a vector with the probabilities of every possible outcome of a classification problem. This paper aimed to design a deep convolutional networks-based system that can identify license plates, colors, makes, and models of cars in traffic from available video feeds (in real or near-real time). The camera should be mounted on the inside of the windshield of a car in traffic to detect the cars in front of it. The system is intended to run on an embedded Raspberry Pi computer and car color, license plate, make, and model information should be displayed on the computer screen. Special attention was given to convolutional neural network-aided detection, which was the core element of the system. Seeing as the system comprises various subsystems, the paper also provides the rationale for selecting specific algorithms and describes the process of collecting required images and files as well as neural network learning. The resulting system achieved acceptable results for the target application and is more cost-effective and easier to set up/install than the systems that are currently in use.

Materials and methods
The flowchart of the proposed solution for the automatic detection and classification of cars in traffic is shown in Figure 1. As shown in the figure, the system consists of three subsystems. The input for all subsystems is CNN output in the form of part(s) of images (called "patches") where a car in traffic was detected. The first subsystem detects the license plate position in the input car image and then forwards that patch (containing only the license plate) to a character recognition algorithm, which then scans the license plate. Car detection in videos/images is done by the You Only Look Once (YOLO) v4 neural network architecture. The second subsystem detects car makes and models (for which it was trained) based on additional CNN architectures, whereas the third subsystem detects car color using computer vision techniques. The rest of this section briefly describes the algorithms used in this study as well as the data collection for network learning and system testing. In this paper, the term "car make" refers to a car manufacturer (e.g., Audi, Alfa Romeo, Renault), while "car model" refers to a product line designation used by the manufacturer (e.g., Q1, 159, Scienic). Both parameters (make and model) were used to avoid confusion in future system extensions in cases where two different manufacturers use the same car name (e.g., Oldsmobile Fiesta and Ford Fiesta or Opel Diplomat and Dodge Diplomat). Of course, car models may be further subdivided based on the year of production and variant, but as this paper does not delve into such detailed classifications, this remains a possible future expansion of the system.

Detecting cars
Cars were detected with a YOLOv4-tiny detector [34]. YOLOv4-tiny is based on a YOLOv4 architecture but features a streamlined network structure with fewer parameters and filters. This makes it suitable for use on mobile and embedded devices [34]. Thus, it was an appropriate choice for the Raspberry Pi computer. In general, CNN-based detectors may be divided into two-shot (e.g. various versions of Region-based Convolutional Neural Networks -R-CNN [35], Region-based Fully Convolutional Networks -R-FCN [36]) and single-shot (e.g. YOLO and SSD) detectors. All two-shot detection models work by proposing regions. The detection process has two stages. The model first proposes a set of specific areas based on the selected search network or regional proposal network and the classifier then processes only candidates for the selected region (which makes it computationally expensive). A different approach to detection is skipping the region proposal stage and directly initiating image detection. These (single-shot) detectors are much faster and simpler but at the cost of somewhat poorer performance.
Being a single-shot detector, YOLO skips the region proposal step and takes a single "look" at the image to detect objects, which speeds up the process compared to two-shot architectures [34,37,38]. YOLO divides the input image into an SxS grid. If the center of an object falls into a single grid cell, that cell is then responsible for detecting the object. In other words, the object is assigned to a cell containing the center of the object.
The classification and localization network can only detect one object at a time, which means that the network can detect only one object per cell. This can cause several issues: 1) as the total number of grid cells is S 2 , the maximum number of objects that the model can detect is also S 2 . 2) if the grid cell contains more than one object, the model will not be able to distinguish between them, 3) if an object spans more than one cell, it may be detected more than once [34].
The first-generation YOLO was trained to detect 20 different classes, such as a person, cat, dog, car, etc. [34]. YOLO uses a single convolutional network to predict multiple bounding boxes at the same time, as well as the probability of these boxes. The network was inspired by the GoogleNet image classification model. It has 24 convolutional layers followed by 2 fully connected layers [34].
The YOLOv2 architecture is mainly focused on improving the recall and localization of the YOLO network while maintaining classification accuracy. The following options enable better performance: • Batch normalization -improves mean Average Precision (mAP) by more than 2%.
• High-resolution classification -network learning using 224×224 images, but the last 10 epochs use 448×448 images. This ensures better network performance for high-resolution image inputs and a 4% mAP increase.
YOLOv2 identifies the best anchor field shapes to facilitate network learning and calculates bounding box coordinates directly, using fully connected layers. The creators of YOLOv2 recommend using Darknet-19, a new classification model with 19 convolutional layers and 5 pooling layers. It has 91.2% accuracy on the ImageNet database -an improvement over the previous version, which had 88% accuracy [39]. To make YOLOv2 robust for detection in different-sized images, the model was trained using various input image sizes. This allows the network to predict objects in different resolutions, providing an easy compromise between speed and accuracy.
During its heyday, YOLOv2 was the fastest detector. However, detectors developed in recent years achieve superior accuracies, but YOLOv2 is still one of the fastest. Consequently, YOLOv3 traded in some of its speed for improved accuracy [40]. A major change with this version was the new architecture consisting of 53 convolutional layers, significantly more than YOLOv2. The salient feature of the new version is detections at three different sites.
Detection at different layers helps address the issue of detecting small objects, which was a major drawback with YOLOv2. The first detection is responsible for detecting large objects, whereas the second and third layers detect medium and smaller objects, respectively.
YOLOv4, the most recent version of the detector (at the time of designing the proposed system), strikes a balance between detection speed and accuracy. As detectors are trained off-line, this advantage is harnessed to develop better training methods and produce detectors with greater accuracy at the same cost. Training strategy-changing methods are known as the "bag of freebies" [41]. The purpose of data augmentation is to increase the variability of input images and make the object detector more robust to images obtained from various environments. Photometric and geometric distortions are the two most frequently used data augmentation methods. Photometric distortion adjusts image brightness, contrast, hue, saturation, and noise, whereas geometric distortion adds random image scaling, cropping, and rotation. In terms of architecture, the objective is to strike the optimal balance between the input network resolution, number of convolutional layers, number of parameters, and number of output layers. A reference model optimized for classification may not be optimized for detection. Therefore, Soviany and Ionescu [38] conducted extensive research on this topic using three potential backbones of the YOLOv4 architecture: CSPResNext50, CSPDarknet53, and EfficientNer-B3. The CSPDarknet53 neust-open.unist.hr ral network used in this paper has been experimentally proven to be the optimal backbone model for the detector. An additional advantage of this basic architecture was its use in the OpenCV library, which shortened the system implementation time. CSPDarknet53 was overlaid with a Spatial Pyramid Pooling Layer (SPP) [42] as this significantly increases the receptive field, isolates the most salient contextual features, and causes almost no network speed reduction. Another key decision when designing the model was using Path Aggregation Network (PANet) to aggregate parameters from different backbone layers for different detection levels [43]. YOLOv4, therefore, consists of a CSPDarknet53 backbone, an additional SPP module, PANet-based information flow, and YOLOv3 as the detector.
Since this system is designed to run on a Raspberry Pi computer, the most appropriate YOLO implementation must be selected. The decision is all the more important because the finalized system is supposed to run in parallel the plate and car model detection, plate classification, and character recognition on the detected plate. The implementation used in this paper was OpenCV-dnn as this was the fastest YOLOv4 implementation for CPU. This has been proven by several tests [44,45] on different CPU configurations as well as for various deep architectures (e.g., VGG-16 and DenseNet121).
The algorithm used to detect cars in an image, draw bounding boxes, and extract car images for a more in-depth car classification is shown in Appendix 1.
After successful detection, the detected car has to be cut out from the image and then sent for processing and detection. As the cropped image is smaller than the input image, there are less data to process, so this approach speeds up the process and makes it more computationally efficient.

Detecting car make and model
Single Shot Detector (SSD) [46] is another convolutional neural networks-based detector that was used in this study. Removing bounding box suggestions and pixel or feature sampling phases results in a fundamental improvement in speed. Upgraded features include a small convolutional filter to predict objects and offsets in bounding boxes with the help of separate filters to detect different aspect ratios. These modifications provide for a high level of accuracy; selecting relatively low-resolution grants additional increases in detection speed. Although these may seem like minor contributions taken separately, SSD testing has yielded very favorable results [46]. SSD harnesses the principle of feedback convolutional networks. It generates fixed-size bounding boxes and gauges the probability of an object appearing in the boxes and then uses a non-maximum suppression for the final detection. Early network layers are based on standard image classification architecture. This is followed by an auxiliary detection structure with the following salient features: • Multi-scale feature maps -appends feature layers of decreasing size to the end of the base grid to allow multi-ratio detection.
• Convolutional predictors -every appended feature layer can produce a fixed set.
• Default boxes -a set of default bounding boxes associated with each cell feature, for more features at the top of the grid. The default box value can be used to assign convolutional annotations to the background, for a fixed position of each box in relation to the corresponding cell. For each cell, feature maps predict offsets relative to default box shapes in the cell, as well as grades per class. These, in turn, indicate the presence of a class instance in each of these boxes [46].
During learning, default boxes that match detection should be established and the network trained accordingly. Ground truth boxes are matched with default boxes using the best Jaccard coefficient. Default boxes match ground truth boxes if their Jaccard coefficient is greater than 0.5. This simplifies the network learning process and allows the network to overlap several default boxes instead of choosing the box with the greatest overlap [47].
An important learning parameter is selecting default box scales and ratios. To handle different scale objects, some methods suggest processing an image in different sizes and then combining the results. However, using a feature map from several different layers can mimic the same effect while sharing parameters. This also improves semantic segmentation as the deepest layers record more fine detail.
The SSD was selected for the car model classification task precisely due to its speed. We used the implementation from TensorFlow, a Python library created by Google. It should be noted that SSD algorithms implicitly contain features for image classification, although their primary use is in object detection in images. In cases where this subsystem receives a reliable patch containing a car from the car detection step, a simple network, such as MobilNet v2, would be sufficient for classification. However, we believe our method helped avoid potentially problematic real traffic situations where the previous step may deliver a box containing more than one car (because cars overlap or there is another car in the target car's rectangular box) and therefore laid a foundation for updated versions of the system with options to track and classify more than one car per image (video), making the system more modular.
TensorFlow can train and run deep neural networks for digit classification, object detection in images, text creation, sequence models for machine translation, natural language processing, and simulations based on partial differential equations [48].
The procedure is the same as for training the YOLOv4-tiny network. Detectable car models should be defined before the image collection stage. Given the vast number of different car makes and models on roads, the network was trained using only a small subset of (most popular) vehicles: VW Golf 7, Renault Clio, Suzuki Vitara, Peugeot 208, Renault Scenic, VW Polo, Citroen C4, BMW 3. The "Others" class was added to classify cars that did not make the list and address the challenge posed by a large number of car models on roads.
Networks can be trained from scratch but there are also pre-trained models available.
Transfer learning usually refers to taking a model trained for solving one problem and applying it to a different problem. In deep learning, the transfer learning technique refers to using a problem that is similar to the problem we are trying to solve to train a neural allow for a further file size reduction and execution speed increase [49].
The "Model" function detailed in Appendix 1 receives a car cut-out image, which is then sent to SSD to classify the car model and make. The functionality returns the name of the detected car model.

License plate detection and recognition
Automatic Number Plate Recognition (or ANPR) is a system designed to detect and recognize vehicle license plates without human intervention. It includes the following steps: 1) license plate detection in input images and 2) application of the optical character recognition to detected plates [18].
ANPR is a challenging branch of computer vision due to the wide variety of license plates in different countries around the world. In this study, car license plates were detected using the OpenCV library. The first step in detection was changing image resolutions to avoid issues due to differently sized images. Images were then converted from RGB to black and white format. As the pixel color value does not have an effect on the location of the plate in the image, this change sped up the subsequent image processing operations. All other content in the image except the plate is useless information that may cause interference and therefore has to be eliminated. This is achieved by applying a two-sided filter to blur unwanted details in the image. By changing the parameter values for this filter, the image was blurred with different blur intensities to eliminate interference. The parameter value needed to be carefully adjusted as a too-large value may negatively affect the useful part of the image, i.e. the license plate.
The plate's edges could be detected after eliminating interference. This was done by defining the minimum and maximum values of the intensity gradient to show only edges with intensity gradients within the two values. Only the top 30 detected contours were then sorted by size, in descending order. These 30 contours were assumed to contain the edge contours of a license plate. Since license plates are rectangular, the system detected 4 contours that circumscribe a rectangle. After identifying the plate's bounding contours, the remaining operations were performed only on that image patch and the rest of the image was eliminated. Croatian license plates uniquely feature the national coat-of-arms of the Republic of Croatia, but this may cause recognition errors. To avoid this, a two-sided filter and a threshold filter were applied to plates to reduce the effect of the coat-of-arms.
Optical Character Recognition (OCR) is a technology for extracting text from images, scanned documents, and photographs. It converts images that contain text into machine-readable textual data. Tesseract [50] is an OCR method that supports more than a hundred languages and can learn new languages. In this paper, we used Tesseract to recognize characters on studied license plates because it provides support for Croatian language. Tesseract can convert and segment images into lines of text or words. A license plate can then be viewed as a single line of text or as a single word. After testing Tesseract's two segmentation options, it was concluded that observing the plate as a single word provided slightly better results. Detectable characters were limited to Croatian capital letters and numbers as these are the only types of characters that appear on license plates [50].
To recognize characters on a license plate, Tesseract first had to detect a license plate. This was a challenging task because license plates take up a minor portion of images. The remainder of the image further complicates matters because it interferes with detecting plate contours. Due to the small size, characters on detected plates are hard to recognize.
Images are also often blurry because the cars were in motion. This method may therefore detect wrong contours or altogether fail to detect any contours. The structure of the functionality for detecting and recognizing characters on a plate is shown in Appendix 1.
Finally, we note that no original or additional YOLO (or any other) neural networks were used in license plate detection or recognition to improve computational efficiency. More specifically, preliminary tests had shown that this tended to slow down the system, whereas the utilized computer vision methods did not have such limitations. In addition, the size of the training dataset (images) would need to be significantly bigger, which would in turn make the extension to detecting and recognizing foreign license plates more challenging (additional dataset).

Identifying car color
Colors were classified using the K-Nearest Neighbors algorithm trained by Red-Green-Blue

Data collection
Creating a class detection system requires the following files: 1) an image set with coordinates and specific class annotations, 2) a configuration file matching the number of classes, 3) files with class names and absolute paths to specific directories, and 4) learning and testing text files with image names.
One of the seminal tasks when training a neural network is collecting and labeling data (images) since CNN requires large amounts of data for training. Annotated car images were collected online, from the Open Images Dataset. Open Images Dataset comprises 9 million annotated images with object bounding boxes, object segmentation masks, visual relationships, and localized narratives. Considering object detection alone, it contains 2 million images with 16 million bounding boxes for more than 600 classes [52]. Examples of images from the database are shown in Figure 3. Bounding boxes from the database were mostly hand-drawn by experts to ensure accuracy and consistency. The images were diverse, including different-sized images and images containing complex scenes and multiple objects. The downloaded database was expanded with images downloaded from various motor vehicle classifieds and a portion of images was collected independently, on roads. Since the diversity of images was essential, the image database was created using as many different sources as possible. The goal was to have the observed class in different environments, with a varying number of surrounding objects, in conditions of good and poor visibility. In the collected image database, the "Car" class is shown on the highway, downtown, in parking lots, during the day, and in the dark.
The total number of collected images was 2,100.
In photographs and images sourced online, observed objects were annotated manually.
Several programming tools enable loading images, annotating relevant classes for neural network training, and storing relevant data as notes in the appropriate format. LabelImg is a graphical image annotation tool. It is written in Python and uses Qt for its graphical interface. Annotations are saved in Pascal VOC and MS COCO format [53]. The user interface is shown in Figure 4.

st-open.unist.hr 13
Collecting images to train the car make and model classification system was somewhat more complicated than for car detector training because it required images of specific car models and makes. Most of the images were sourced online or from car classifieds. This approach assumed that the camera was installed on the windshield of a moving car and that the detected cars were moving in front of that car. This translated into looking for rearview car images. The number of images collected per car model was 200 (2,000 in total, as we used 10 classes), separated into 190 training images and 10 images for network testing (for a total of 1,900 training images and 100 testing images). The images were distributed in the 95:5% ratio as finding images of specific car models required a considerable amount of time and a greater focus was placed on training. We note that a validation dataset was not employed during distribution, given that its primary purpose is to tune neural network hyperparameters to increase accuracy [54]. The focus of this phase of system development was on proving the concept, so the step was omitted. Assigning images to a validation set would unnecessarily reduce the (already limited) set of training images and thus potentially reduce the accuracy of the results. After distribution, the cars were annotated with LabelImg tool. This was followed by the creation of a feature map containing the names of all car models.
To classify car colors, the first step involved collecting images of cars of different colors and sorting them by color into directories. Next, a file with the RGB value of each image was created. Car colors were identified based on these values by finding the nearest known neighbor in the color space using the kNN algorithm. To ensure better color classification results, the kNN algorithm was only applied to patches that contained a car. Seeing as the car was detected by the YOLOv4-tiny convolutional neural network, the kNN algorithm was applied only to the neural network output (Figure 1). For even better results, patches containing a car were split in half horizontally to eliminate the rear windshield. This was done because testing had shown that reflections from rear windshields and tinted rear windshields may have a significant effect on the algorithm output. Table 1 shows a summary of the number of images that were used to run and/or develop parts of the system. We note that images from one subsystem were not used in other subsystems (except for license plate detection and recognition systems). This means that the total image set for all subsystems contained 4,220 different images. The (image) dataset used did not contain any negative examples or images without a car.

Training neural networks
Neural networks can be trained on a local computer or by using Google Colaboratory in the cloud. If using a local computer, the training time largely depends on its processor. Additionally, only some NVIDIA graphics cards provide Graphics Processing Unit (GPU) support. Using Google Colaboratory, therefore, makes learning easier and faster.
Colaboratory, or "Colab" for short, is a product from Google Research that allows writing and running Python code through a web browser. Colab is especially well suited to machine learning, data analysis, and education. The product does not require any programming environment setup and provides free access to computer resources, especially the Tesla K80 graphics card. The major advantages of Colab are thus its ease of use and free GPU access [55].
Before initiating neural network training, the configuration file needs to be customized to accommodate classes and various learning parameters. Table 2 shows parameter values set in the configuration file. The "Batch" parameter defines the number of images sent to the network in batch to enable network learning. In addition, Table 2 shows other hyperparameters introduced to ensure the repeatability of the experiment. Note that most of these were default parameters from the pre-trained network and no algorithms were used to optimize them. Consequently, better hyperparameters that would ensure greater system performance and accuracy may be available, but their effect on execution speed is unclear.
Due to the simplicity of the YOLOv4-tiny architecture, the "Batch Size" parameter was set at 64. As the height and width of images had to be multiples of 32, both were set to 416 pixels. This size was a compromise between high-performing detection and training time. For better results, height and width parameters would need to be increased, but this would also increase training time. The maximum number of batches is calculated using formula (1): This parameter cannot be set to less than 6,000, so all detectors with less than 4 classes set this parameter to 6,000. The value for a 5-class detector was 10,000. The next parameter was a step, which had two values set to 80% and 90% of the maximum batch value, respectively. This parameter was set to 4,800 and 5,400, respectively, to train a single class. The number of classes had to be set to the desired value, in this case 1 because the network was trained to detect cars only. The number of filters was set based on formula (2): (1) Two files had to be created to train the YOLOv4-tiny detector.  Table 3.

st-open.unist.hr
After merging individual subsystems into a single system, the programming portion of the work was over and the only thing left was to test the system on the Raspberry. After installing all required libraries, the automatic traffic detection and classification system was ready for in-car mounting and video testing.
The automatic traffic detection and classification system was mounted within a car as shown in Figure 5. The camera was mounted on the rearview mirror. System speed is a key factor in video testing and this system ran at 1 frame per second (1 fps).  (Figure 6).   figure). The probability threshold is a limit value used to decide whether the required object was, in fact, detected: if the value is less than the default value, the detected object is assumed not to be the required object, so the bounding box is not drawn. The threshold value for car detection in this paper was set to 0.8. This value was determined experimentally and was somewhat more rigorous than normal to minimize (eliminate) the number of false-positive detections and ensure the normal functioning of other subsystems that work with patches containing detected cars. After successful detection, an image patch containing a detected car had to be cut out and sent for further processing and detection. As the cropped image was smaller than the input image, there were less data to process, so this approach sped up the entire system.  Figure 10, however, shows that even in narrow boxes, there was always some background due to the mismatch between the box and car shapes. The    figure also shows how the color of the rear windshield can affect the RGB color histogram of the whole car.

Figure
An example of the license plate detection and recognition subsystem operation is shown in Figure 11.
The SSD network was trained for 4 hours, or 8 thousand iterations. The loss function for classification was a weighted sigmoid function. The loss function showed a loss reduction and convergence with increasing iterations. The final loss value was 0.1 (Figure 12). After training, the model was ready to use, but the model first had to be converted to a format that can be understood by the Raspberry Pi.
The system for automatic detection and classification of cars in traffic was tested using three different datasets. Table 4 shows the test results for 1,000 images used for network training. The output percentage shows relatively good network behavior (detection efficiency over 90%), which was the expected result since the output was based on the same images that were used for learning (development) in individual subsystems.  System performance was then tested using images that the subsystems did not see during training (test set), so their significance in terms of the achieved results was somewhat higher. The results obtained for 200 such images are shown in Table 5. Readers should note that the tables use the accuracy parameter as a representation of system performance (the same method was used in similar papers, e.g. [56]), but there are other parameters (such as IoU) that may give different insights into the system's functioning but were not used to report results in this paper.
The results shown in Table 5 match similar systems available in the literature, except for the percentage of successful license plate detections and recognitions, which was slightly lower [57,58]. Consequently, the following section gives a detailed analysis of the latter.
To ensure a correct interpretation of this percentage, the following list identifies several general parameters that are not related to the used algorithm but affected the presented result: the quality of images, license plates for which the system was developed, and result reporting/interpretation methods. Additionally, a computation simplicity criterion for real-time execution (which often involves sacrificing some accuracy) of the system (itself being just part of the complex overarching architecture) needs to be taken into account.
Given that the proposed system analyzed Croatian license plates, it makes sense to compare it directly with systems from literature that work with the same type of license plates. Unfortunately, the number of available papers on this subject in the literature is limited (e.g. [59][60][61]), as are the possibilities for comparison. For example, Romić et al. [59] reported the accuracy of "about 80%", Henry et al. [60] reported the accuracy of 97%, and Novosel [61] reported the accuracy of 81.05%. However, for the sake of a correct interpretation of these results and their comparison with the results of our study, it should be noted that Romić et al. [59] and Henry et al. [60] used higher-quality images of stationary cars, with a constant distance between car and camera. Due to the uniqueness of Croatian tables, Henry et al. [60] interpreted the "O" and "0" characters as a single character (i.e., wrong character classifications for the pair were not reported). In Novosel [61] a similar approach was used which treated not only "O" and "0" but also the "1" and "I" characters as equivalent values, as these were impossible to distinguish even by humans at lower resolutions. The reported accuracy of 81.05% was obtained by calculating the total number of characters on all license plates in the images and comparing them with the number of incorrectly detected characters. In contrast, this paper reports the results for accurate detections of the full license plate (without introducing any character equivalence, such as between "O" and "0" or "I" and "1", respectively). Therefore, even in cases where the system accurately detected 7 out of 8 characters on a plate, this was labeled as an incorrect classification (in other words, 0% as opposed to the 87.5% accuracy when using the result interpretation method as in Novosel [61]). The proposed system did not use the more complex approaches seen in Henry et al. [60] and was tested on static and dynamic lower-resolution images shot from various angles. Due to all the above, we believe that the achieved result of 51.3% was relatively good, but also that there is room for improvement, primarily by adding a classification neural network (which would, consequently, require more robust hardware).
Finally, the system was tested on a limited set of images that did not contain cars to test system behavior in such situations that may arise in real-world applications. The results of this testing are shown in Table 6.
According to the results shown in Table 6, the system behaved appropriately in these sce- Based on the results presented in Tables 3, 4, and 5, we may conclude the following: A YOLO detector was used to detect cars. In tests, the system correctly detected cars in 92.1% of cases (on the training set) and 88.5% on the test set. Diminished accuracy on the testing set compared to the test set was expected. Since the decline in accuracy was not significant st-open.unist.hr (and the accuracy was comparable), it may be concluded that there was no overfitting of the neural network. Accuracy comparison between parts of the proposed car detection system and similar systems from the literature is given in Table 7. The table also provides additional data on the images and platforms used in the studies to provide more context for the obtained results and their comparison with the proposed system. Car models and make classification using SSD neural networks yielded the accuracy of 81.4% and 78.5% on the training and test set, respectively. As was the case for car detection, a small decrease in accuracy was also observed here, suggesting (in combination with the losses in Figure 12) that the neural network learned correctly and that there was no overfitting. Visual examples of car make and model detection are shown in Figure 13. To provide some context for the results for this subsystem, we will provide a short comparison with similar systems from the literature (focusing exclusively on detecting car makes and models, not the whole system, as was the case here). In Lee et al. [69], the convolutional neural network used to detect car makes and models had an accuracy of 96.3%. To develop the system, a database with 291,602 images (80% training and 20% testing) and 766 car models (collected during one calendar year) was created using a stationary camera. The system ran on a platform with an i7@2.6GHz processor and GeForce GTX 1080 graphics card. Interestingly, the results shown in this paper were compared with 7 other systems from state-of-the-art literature where the accuracy ranged from 85% to 98.63%.
Ni and Huttunen [70] also provide a brief literature review but make a distinction between make recognition systems (4) and model recognition systems (19 [71] ranged from 75% to 95%, depending on the database used. The authors included 400 available car makes and 7,000 models, but did not specify the images and hardware (video cameras) used for detection. Table 8 shows the confusion matrix using 200 test images for car model classification.
The biggest challenge for this subsystem was posed by the "Others" class as it comprised a large number of different car models. Not surprisingly, the results for this class (68.8%) had the most significant negative impact on the overall accuracy of the subsystem. To detect license plates, the authors used a method from the OpenCV library (findContours()) to find plates in a binary image using contours. The binary image was generated using the Canny edge detector. A shortcoming of this approach was that it sometimes detected the rear windshield instead of the plate. When recognizing characters on a plate, the characters may be too small to recognize. The most common errors involved similar characters, such as the capital letters "I" and number "1" or conflating the letter "Z" with numbers "2" or "7". Croatian license plates can also contain diacritics, such as the letters "č", "ć", "đ", "š", and "ž". Detecting diacritical characters on plates was problematic because the glyphs are generally poorly done, which makes them hard to recognize even with the naked eye.

Discussion
In this paper, we developed an automatic convolutional neural network-based computer vision system for the detection and classification of cars in traffic [33]. The system is autonomous and automatically detects cars on the road as well as their makes and models, classifies colors of observed objects, and recognizes license plates.
Preconditions for the proper functioning of the system include having a rear-view image of a car and a manageable distance between the camera and the observed car (otherwise, the license plate may be too small for reliable detection and character recognition). The system was implemented on a Raspberry Pi 4 laptop and mounted on a car's windshield (on the inside) to observe cars in front of the camera. The system provides quick, straightforward solutions to enable real-time implementation on a Raspberry Pi.
Car detection using the YOLO architecture had a total accuracy of 88.5%. This accuracy is comparable to the Nero-APNR system, which has an accuracy of 90% [20,21]. However, the Nero-APNR system can operate at higher shooting speeds and higher car speeds with up to 10 simultaneous detections. It is unknown whether it can classify car models and makes as well as colors, which we believe is a strength of our system, as it facilitates the detection of irregular (e.g., stolen) license plates mounted on cars which can potentially deceive standard license plate recognition systems (such as Nero-APNR). The proposed system can achieve all of the above with a much simpler setup and at a fraction of the cost [20,21]. The proposed system classifies color and model and detects license plates only in a patch containing a detected car, as opposed to the whole image. Consequently, these steps depend in part on the car detection phase. Car detection accuracy has the greatest repercussions for color classification. More precisely, the potential for incorrect car color classification increases when the bounding box contains a lot of surrounding environments. The SSD detector used for car model and make classification achieved a 78.5% accuracy. The final task was detecting and recognizing characters on license plates. A plate was detected using various filters to find contours. This was a complex task due to blur in images of moving cars as well as Tesseract's struggle with recognizing characters at greater distances. Rainy, foggy, and night conditions on the road significantly affect the performance of the system because they degrade the quality of images recorded by the camera [11][12][13]. Additionally, cars or their parts (e.g. license plates) can be dirty even in favorable weather conditions, which may pose a challenge for some of the proposed subsystems. The system has not been tested in these and similar situations, which is a limitation of the present study.
The system was set up on a Raspberry Pi 4 which can process one frame per second (1 fps) from the available real-time video feed. This system speed was sufficient as there were no quick vehicle changes in front of the camera. Higher system speeds may be achieved by replacing the Raspberry Pi computer with a Jetson Nano, as its GPU would significantly speed up the overall system [72]. Although conducted in similar, but not identical settings (hardware + algorithms), a benchmark comparison between the speed of the Tiny YOLO v3 algorithm running on a Raspberry Pi 3 and a Jetson Nano computer confirmed this assumption [73]. More specifically, on a Raspberry Pi computer, this algorithm ran at 0.5 fps, while the Jetson Nano ran at 25 fps [73], which is a 50-fold acceleration. Car detection st-open.unist.hr 27 using the YOLO architecture yielded good results in the proposed system, but other neural network-based architectures may also be used to achieve higher accuracy, albeit this would slow down the system. The replacement would also improve color classification results. Lastly, another potential area for improvement is replacing the OpenCV-based license plate detection method with a detector (e.g. a neural network-based system [74]) to ensure greater accuracy and better handle more realistic operating conditions, such as a partially dirty license plate. Naturally, the existing plate detection and recognition methods may also be improved and extended by adding image processing algorithms, such as morphological operations to extract plate contours.
Based on the above, it may be concluded that the proposed system works very well in realistic, although idealized conditions, but it has not been tested in more challenging scenarios that may occur in routine use in traffic (e.g., rain, direct light, etc.). Therefore, such scenarios need to be included in the testing image set to further develop and improve the system in accordance with the obtained results and observed shortcomings. Currently, the system's major shortcomings have to do with plate detection and recognition, especially at higher speeds, which additionally highlights the need to increase the low processing speed of 1 frame per second. However, the promising accuracy and low cost of this automatic system for the detection and classification of cars in traffic laid the foundations for its continuing development, including the development of a neural network for license plate recognition. With some minor modifications, the system could also be used in car parks for automatic ticket payment, mounted on traffic light control systems, as well as in other similar applications. st-open.unist.hr